Last updated: 2021-01-15
Checks: 2 0
Knit directory:
fa_sim_cal/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version c674a51. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the
analysis have been committed to Git prior to generating the results (you can
use wflow_publish
or wflow_git_commit
). workflowr only
checks the R Markdown file, but you know if there are other scripts or data
files that it depends on. Below is the status of the Git repository when the
results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .tresorit/
Ignored: data/VR_20051125.txt.xz
Ignored: output/ent_cln.fst
Ignored: output/ent_raw.fst
Ignored: renv/library/
Ignored: renv/staging/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made
to the R Markdown (analysis/notes.Rmd
) and HTML (docs/notes.html
)
files. If you’ve configured a remote Git repository (see
?wflow_git_remote
), click on the hyperlinks in the table below to
view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | c674a51 | Ross Gayler | 2021-01-15 | Add 01-6 clean vars |
html | 0b67848 | Ross Gayler | 2021-01-06 | Build site. |
Rmd | ec38ffc | Ross Gayler | 2021-01-06 | wflow_publish("analysis/no*.Rmd") |
html | 03ad324 | Ross Gayler | 2021-01-05 | Build site. |
Rmd | b03a2c1 | Ross Gayler | 2021-01-05 | wflow_publish(c(“analysis/index.Rmd”, “analysis/notes.Rmd”)) |
This document is for keeping notes of any points that may be useful for later project or manuscript development and which are not covered in the analysis notebooks or at risk of getting lost in the notebooks.
Get a sizeable publicly available data set with personal names (NCVR).
Use sex and age in addition to personal names so that most records are discriminable.
High frequency names will likely not be discriminable with only these attributes.
Age (and possibly sex) will be used as a blocking variable.
Age and sex are also of interest in the calculation of name frequency because name distributions should vary conditional on age and sex.
Keep address and phone number as they may be useful for manually checking identity in otherwise nondiscriminable records.
Get the oldest available data to minimise it’s currency (NCVR 2005 snapshot).
Drop objectionable attributes such as race and political affiliation.
Apply basic data cleaning to the predictive attributes.
This is probably unnecessary given how the data will be used.
I can’t bring myself to model data without scrutinising it first.
Only keep records that are ACTIVE and VERIFIED for modelling.
These are likely to have the highest data quality attributes.
These are least likely to have duplicate records (i.e. referring to the same person).
Look at frequency distributions of names conditional on name length. The Zipf distributions may have different shape parameters for different name lengths. Name length might be examined as an alternative to name frequency for interaction with similarity.
Look at frequency distributions of names conditional on age and/or sex
Use blocking to reduce the number of comparisons for practicality
Blocking on age will give more homogeneity of first names within blocks because of name popularity varying over time.
Blocking on county will give more homogeneity of last names within blocks because of families living together.
Try indicators for missingness. Missingness may be differentially informative across different predictor variables.
Try indicators for similarity == 1. The compatibility of exact string equality is not necessarily continuous with the compatibility of similarity just below 1.
Try name frequency as an interactive predictor variable.
There are two names in each lookup: dictionary and query. Therefore there are also two name frequencies to be considered. Consider how to use both frequencies (e.g. min, max, geometric mean, …).
Queries may contain names that do not exist in the dictionary, so we need to deal with that case.
Do we need to apply frequency smoothing, as used in probabilistic linguistic models?
Do we need to estimate the probability mass of unobserved names?
In general, the dictionary will be a subset of the entities in the universe of queries. Consider the impact of this on modelling as the fraction of the query universe in the dictionary varies.
Partition the records into a dictionary and a set of queries (the QU set).
Select a subset of the dictionary records to use as the QM query set.
This is evaluation is different to the usual evaluation of entity resolution in that it doesn’t consider the impact of transcription/typographical variation in the queries.
If we are interested in the performance with respect to transcription/typographical variation we may need to consider artificially corrupting some of the queries.