Last updated: 2021-02-26

Checks: 7 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201104)

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: fbd4d79

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version fbd4d79. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/blk_char.fst
    Ignored:    output/ent_blk.fst
    Ignored:    output/ent_cln.fst
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  analysis/workflow.Rmd

Unstaged changes:
    Modified:   analysis/index.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/notes.Rmd) and HTML (docs/notes.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	fbd4d79	Ross Gayler	2021-02-26	Add sampling section to notes.Rmd
html	c421bb5	Ross Gayler	2021-02-19	Build site.
Rmd	815af1e	Ross Gayler	2021-02-19	Add sampling to notes
Rmd	a2cde2d	Ross Gayler	2021-02-17	end of day
Rmd	a093655	Ross Gayler	2021-02-16	end of day
Rmd	0935c13	Ross Gayler	2021-02-10	end of day
html	0d30c5b	Ross Gayler	2021-01-26	Build site.
html	23d740f	Ross Gayler	2021-01-24	Build site.
Rmd	d0307f5	Ross Gayler	2021-01-24	Add 02-1 char block vars
Rmd	6df8db7	Ross Gayler	2021-01-24	End of day
Rmd	3090bea	Ross Gayler	2021-01-23	End of day
Rmd	94b7b20	Ross Gayler	2021-01-19	End of day
html	0c48ca5	Ross Gayler	2021-01-17	Build site.
Rmd	3052780	Ross Gayler	2021-01-17	Add 02-1 block vars
html	3eb6e3b	Ross Gayler	2021-01-16	Build site.
Rmd	e6721d3	Ross Gayler	2021-01-16	Add blocking to notes.Rmd
html	5ab5dc4	Ross Gayler	2021-01-15	Build site.
Rmd	c674a51	Ross Gayler	2021-01-15	Add 01-6 clean vars
html	0b67848	Ross Gayler	2021-01-06	Build site.
Rmd	ec38ffc	Ross Gayler	2021-01-06	wflow_publish("analysis/no*.Rmd")
html	03ad324	Ross Gayler	2021-01-05	Build site.
Rmd	b03a2c1	Ross Gayler	2021-01-05	wflow_publish(c(“analysis/index.Rmd”, “analysis/notes.Rmd”))

This document is for keeping notes of any points that may be useful for later project or manuscript development and which are not covered in the analysis notebooks or at risk of getting lost in the notebooks.

1 Entity data

This section refers to data about the entities to be resolved. I have qualified ‘data’ with ‘entity’ on the grounds that we might possibly use some other data (not directly about entities) in the project (e.g. data about the frequencies of names in the population).

Get a sizeable publicly available data set with personal names (North Carolina Voter Registration).
- The focus of the empirical work is on string similarity metrics of names and the frequencies of the names.
Use sex and age in addition to personal names so that most records are discriminable.
- High frequency names will likely not be discriminable with only these attributes.
  - This is not a problem because we are really interested in whether the methods proposed here assist in quantifying the discriminability of records. We want records spanning a wide range of discriminability.
- Age (and possibly other variables, such as sex) will be used as blocking variables.
  - Blocking is probably needed to make the project computationally tractable.
- Age and sex are also of interest in the calculation of name frequency because name distributions should vary conditional on age and sex.
Keep address and phone number as they may be useful for manually checking identity in otherwise nondiscriminable records.
- As a fallback position, address and phone number can be used as discriminating attributes in the compatibility model.
Get the oldest available data (NCVR 2005 snapshot) to minimise it’s currency and privacy risk.
Drop objectionable NCVR attributes such as race and political affiliation.
A note about the irregularity of names: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/

2 Entity data preparation

Apply basic data cleaning to the predictive attributes.
- This is probably unnecessary given how the data will be used.
- I can’t bring myself to model data without scrutinising it first.
Only keep records that are ACTIVE and VERIFIED for modelling.
- These are likely to have the highest data quality.
- These are least likely to have duplicate records (i.e. be referring to the same person).

3 Blocking

Use blocking to reduce the number of comparisons, to keep the computational cost feasible.
This project is not an operational system and we are only using blocking to reduce the computational cost, so we can choose blocking that would not be acceptable for an operational system.
Where the dictionary blocks vary widely in size we might choose to work with only a subset of blocks that are a suitable size.
If we think that some aspect of the compatibility modelling might vary as a function of block size we would probably want to test this over a wide range of block sizes.
We will probably be repeating analyses over some number of blocks to assess the variability of results, but that might only be a subset of blocks with no commitment to examine all the blocks.
Blocking variables may have missing values in the query and dictionary records.
- This can be handled but is fiddly and not the focus of this project.
- Handling missing predictor values in regression-based compatibility models is simple.
Exclude from use all records with missing values on the blocking records, so that missing values don’t have to be allowed for in blocking.
Try to choose blocking variables with a small proportion of missing values, so as to minimise systematic bias due to their exclusion.
Construct a few potentially useful blocking variables.
Blocking can induce changes in the distributions of names.
- Blocking on sex (in combination with other variables) will definitely give more within-block homogeneity of first names because of gendered names.
- Blocking on age will give more homogeneity of first names within blocks because of name popularity varying over time.
- Blocking on registration county may give more homogeneity of last names within blocks because of families living together.

4 Entity data characterisation

These sections are about looking at the properties of the entity data that will be most relevant to their use in the compatibility models.

4.1 Structure induced by equality

Prior work on name frequency has (implicitly) only considered equality of names. That is two name tokens are either absolutely identical or absolutely different. (No gradations of difference are entertained.) Counting the frequency of names is finding the cardinality of each set of identical name tokens.
- Name frequency will be used as a predictor in a compatibility model, \(compat(q, d_i)\).
- For a query record \(q\) the compatibility will be estimated for every dictionary record \(d_i\) in the block \(B_q\) (the set of dictionary records in the block selected by the query).
- The estimated compatibility will vary over \(d_i\), so we expect the predictors to be functions of \(d_i\) (in the context of the set of dictionary records, \(B_q\) selected by the query record.)
- The (first attempt at) name equality frequency \(f_{eq}(q, d_i)\) is defined as \(\vert \{ d_j : d_j \in B_q \land name(d_j) = name(d_i)\} \vert\).
Look at frequency distributions of names conditional on name length. The Zipf distributions may have different shape parameters for different name lengths. Name length might be examined as an alternative to name frequency for interaction with similarity.
Look at frequency distributions of names conditional on age and/or sex.
- These conditional distributions may increase the predictive power of the compatibility model.
Look at frequency distributions of names conditional on blocking variables.
- This is to get an understanding of the effect of blocking on name frequency distributions.
- In particular, look at any effects of block size (which has a very wide range).
- The anticipated usage is that name frequency will be calculated within the dictionary block selected by the query record. (The block can be construed as the only dictionary that matters for the purposes of the query.)

4.2 Structure induced by similarity

The similarity version of name frequency is an extension of the equality version. It counts the number of dictionary records in the block that are at least as similar to the query record as the currently considered dictionary record.
- The (first attempt at) name similarity frequency \(f_{sim}(q, d_i)\) is defined as \(\vert \{ d_j : d_j \in B_q \land sim(name(q), name(d_j)) \ge sim(name(q), name(d_i))\} \vert\).
Look at similarity frequency distributions of names. It’s not obvious that these should be Zipf distributions. For example, the rare names might be quite similar to more frequent names, which might obscure the long tail of the underlying distribution of names.
Look at similarity frequency distributions of names conditional on name length. The Zipf distributions may have different shape parameters for different name lengths.
- This is of interest because similarity is usually scaled to be between 1 (equality) and 0 (completely different) regardless of the string length. The longer the strings the greater the size of the space of possible strings and the higher the dimensionality of the space. It’s not obvious to me that equality (inequality) of very short strings carries the same evidential value as equality (inequality) of long strings.
Look at similarity frequency distributions of names conditional on age and/or sex.
- These conditional distributions may increase the predictive power of the compatibility model.
Look at similarity frequency distributions of names conditional on blocking variables.
- This is to get an understanding of the effect of blocking on name frequency distributions.
- In particular, look at any effects of block size (which has a very wide range).

5 Sampling

Both modelling and performance evaluation will be based on random samples of the entity data, so sort out the sampling first.
Modelling is necessarily based on query/dictionary record pairs. That is, the record pair is the unit of analysis for modelling.
Performance assessment of model calibration will also be at the record pair level.
Performance assessment of overall model performance at identifying the correctly matching dictionary record (e.g. AUC or F-score) is at the query record level.
- The query-level response is calculated by a computation across all the dictionary/query record pairs induced by the query record.
- Assessment of query-level performance requires that the query records be an appropriately defined random sample of the entity data records.
- The query records should be a uniform random sample of the entity data records. Because to do otherwise is equivalent to assuming that the queries are a systematically biased subset of the universe of entities, which is entirely plausible but is another whole research topic which we will not be addressing in this project.
The query/dictionary record pairs are generated as combinations of a set of queries with a fixed dictionary. The record pairs can’t be uniformly sampled from the set of all possible record pairs because that wouldn’t respect the dictionary definition.
- We need to create dictionaries by some sampling process; create a set of queries by some sampling process; then those two sets of records will jointly determine the query/dictionary record pairs that are in-sample.
We need some sort of training/testing partition to keep performance assessment honest.
Remember that queries are run against a dictionary, so in principle, it would be possible to have separate train and test queries run against the same fixed dictionary.
- I am inclined to think that running training/testing queries against the same dictionary is probably OK because in a practical application the modelling would be done with respect to the dictionary as it existed at some point in time and then queries applied against essentially the same dictionary. (Given a large dictionary, subsequent queries would update it relatively slowly.)
- On the other hand, I am inclined to think that using separate dictionaries for training/test would not cause a problem, because the proposed models are not intended to rely on the actual names in the dictionary, but on properties of the names (like frequency and length), which are likely to be very similar for different dictionaries based on random samples of the universe of names.
- On balance, I think I will use different dictionaries for train/test because it is probably harmless, probably conservative with respect to performance assessment, and probably easy to implement.
Dictionaries are constructed from entity data records, so having separate train/test dictionaries implies partitioning the entity data records prior to constructing the dictionaries.
- The easiest way to partition the records is to partition independently at the record level.
  - Note that records containing common names will occur in both data sets (train and test).
- Partitioning the entity records so that each name type occurred in only one of the train or test data sets would be the most conservative in terms of separation of the train and test data sets.
  - This would be easy if there was only one name field. However, we have three name fields (first, middle, last). Partitioning the data so that all the names occurred in only one data set would drastically reduce the number of records able to be used and bias the data towards names that only occur once (because it is impossible for those names to occur in two data sets).
Partition the entity data records uniformly and independently into training and testing data sets.
- Make the training and testing sets of equal size so that the dictionary statistics are similar.
- The construction of subsamples (queries, and dictionaries of different sizes) of the training set will be paralleled in the construction of subsamples of the testing set.
We will want to repeat the entire analytic process multiple times on different samples in order to assess the variability of the results.
- Repeat the partitioning into training and test sets for as many replicates as are needed.
- The number of replicates is yet to be determined.
  - It needs to be large enough to give a reasonable view of variability of the results.
  - It needs to be not so large as to be too computationally expensive.
The dictionary will be a randomly sampled subset of the test or train universe, because it is usual in reality for the dictionary to be a subset of the universe of entities.
The dictionary sampling will be repeated at a range of sampling rates. This will allow us to investigate calibration of the model as the fraction of the universe present in the dictionary varies.
- It is expected that this can be captured as a change of value of the intercept term of the logistic regression.
- In practice, we usually don’t know the size of the universe relative to the dictionary.
- If this effect is captured as an intercept term it will be simple to adjust the model to reflect the assumed size of the universe.
The queries are selected as a random subset of the relevant universe.
- Use the same queries across all the corresponding dictionaries.
  - This is marginally less effort than choosing different queries per dictionary
  - This should result in the evaluations across dictionaries being somewhat more similar than if the queries were chosen separately for each dictionary, because the “difficulty” of the queries is fixed across dictionaries.
  - The fraction of queries having matches in the dictionary will (obviously) vary with the sampling rate that created the dictionary.
- Sample some fixed number of queries (yet to be determined)
  - The number of queries needs to be large enough to reasonably cover the blocks in the directory, the range of the predictors used in the model, and provide a reasonable observations to parameters ratio for estimating the model.
  - The number of queries needs to be not so large as to be infeasibly expensive to compute the models.

The entire sampling process is summarised in the following diagram.

The square nodes represent sets of entity data records.

The triangular nodes represent random sampling processes.

The text-only nodes represent replicates of random sampling nodes (just made less visually intrusive, for clarity).

The top data node represents the complete data set of entity data records.

The “train/test 50:50 splitter 1” node randomly partitions the total data into two data sets of (approximately) equal size: the “train universe” and the “test universe”.

The other “train/test 50:50 splitter” nodes represent the replication of the train/test split as many times as are needed.

The left-most “query sampler: n_q” node randomly samples \(n_q\) records from the “train universe” to be used as the “train queries” to be run against all of the dictionaries sampled from the “train universe”.

The left-most “dictionary sampler: 90%” node randomly selects 90% of the “train universe” records to use as a dictionary.

The remaining “dictionary sampler” nodes create more dictionaries with different selection rates.

The one query data set is applied to all the corresponding dictionary data sets.

The construction of the query and dictionary data sets from the “test universe” exactly parallels the construction from the “train universe”.

The construction of the data sets under each of the other “train/test 50:50 splitter” nodes exactly parallels the construction under the “train/test 50:50 splitter 1” node.

6 Modelling

Try indicators for missingness. Missingness may be differentially informative across different predictor variables.
Try indicators for similarity == 1. The compatibility of exact string equality is not necessarily continuous with the compatibility of similarity just below 1.
Try name frequency as an interactive predictor variable (interacting with similarity).
- Also consider frequency conditional on age and/or sex
There are two names in each lookup: dictionary and query. Therefore there are also two name frequencies to be considered.
- We are interested in calculating the probability of a correct match for each query, so the sought probability is conditional on the query. Consequently the conditional probability of a correct match should not depend on the frequency of the query name.
- I am not entirely convinced by that argument, so consider how to use both frequencies (e.g. min, max, geometric mean, …).
- Queries may contain names that do not exist in the dictionary, so we need to deal with that case.
- Do we need to apply frequency smoothing, as used in probabilistic linguistic models?
- Do we need to estimate the probability mass of unobserved names?
In general, the dictionary will be a subset of the entities in the universe of queries. Consider the impact of this on modelling as the fraction of the query universe in the dictionary varies.
- Can the sampling fraction be modelled as a prior probability (effectively, a change in the intercept term of a logistic regression)?
  - This can be investigated by varying the fraction of dictionary records randomly sampled from the universe and treating the sampling fraction as a predictor in the model. The fitted coefficients for the sampling fraction can be compared to what would be expected treating the same sampling probabilities as a prior for the logistic regression.
  - Consider whether this varies on a per query basis because of blocking. That is, is there effectively a separate dictionary per blocking value?
- It is feasible to use the “same” variables for blocking and as (components of) predictors. The blocks are based on the values of the dictionary records and selected by the value in the query record. The predictor variables are properties of record pairs - so there will still be within-block variance of the predictor even when there is no within-block variance of properties of the query record. That is, they’re not the “same” variable when they are used for blocking and as a predictor.

7 Performance evaluation

Performance evaluation should always be performed on a disjoint set of records from those used in estimation of the model.
The query records are a random sample from the relevant universe of entity records.
The dictionary records are an independent random sample from the same universe of entity records.
Consequently some of the query records will be absent from dictionary (the \(Q_U\) set of query records).
- By the (reasonable) assumption that there are no duplicate records, each of the \(Q_U\) queries will be unmatched in the dictionary.
The rest of the query records will be present in the dictionary (the \(Q_M\) set of query records).
- By the (reasonable) assumption that there are no duplicate records, each of the \(Q_M\) queries will have exactly one matching record in the dictionary.
This is evaluation is different to the usual evaluation of entity resolution in that it doesn’t consider the impact of transcription/typographical variation in the queries.
- It looks at the quantification of discriminability when the available attributes are not necessarily able to ensure that all records are discriminable.
If we are interested in the performance with respect to transcription/typographical variation we may need to consider artificially corrupting some of the queries.
- Consider assigning some of the dictionary records to randomly chosen wrong blocks. (Is this equivalent to randomly selecting some out-of-block query records to run against each dictionary block?)

8 Writing/theory

Note the relationship to Fellegi & Sunter / probabilistic record linkage (in proposal?).

8.1 Quantity to be modelled

The unit of analysis is a dictionary/query record pair.
The logistic regression model is modelling the probability that the dictionary record refers to the same entity as the query record.
- That is, the record-pair modelling yields an unconditional probability that does not depend on how that predicted probability will be used (e.g. compared to a threshold).
- However, the logic behind the name frequency induced by similarity does appear to incorporate a dependence on usage. The frequency is calculated as the sum of the frequencies of the names (in the same block) that have similarity to the query name greater than or equal to the query similarity to the dictionary name being considered. This is based on an implicit ceteris paribus usage of similarity: If I were to accept this particular name as indicating an entity match, then logically I should also accept all other names that are at least as similar.
  - The name frequency induced as equality can be construed as compatible with this definition by interpreting equality as a degenerate similarity relation.
  - Is this a reasonable argument?
  - Should this implied comparison across names within block be incorporated somehow in the mathematical formulae?
    - This feels like it ought to be notated as a conditional probability (i.e. conditional on the dictionary record being accepted as a match) rather than an unconditional probability.

8.2 Blocking

It might be possible to develop (or at least explain) most of the maths in terms of blocking.
Start with a universe \(E\) of entities with the queries drawn uniformly at random from that universe and the dictionary being identical to the universe.
In the absence of any other information, the probability of each dictionary element being an identity match with the query is \(1 / |E|\).
If the dictionary \(D\) is a proper subset of \(E\) it can be thought of as a (not very informative) block. That is, a block can be thought of as the dictionary induced by the query record.
If \(D\) is a random subset of \(E\), and in the absence of any other information, the probability of each dictionary element being an identity match with the query is \((|D| / |E|) / |D| = 1 / |E|\)
\(|D|\) is the block size of the dictionary construed as a block. In the absence of any information to discriminate between records in the block, the larger the block, the lower the probability the any block record is the identity match to the query record.
\(|D| / |E|\) is the probability that the entity corresponding to the query record is in the dictionary/block, given that the dictionary is a uniform random subset of the universe of entities.
Consider the other extreme, where blocking is perfect. That is, the entity query record is guaranteed to be in the dictionary. In this case the blocking is very informative (not a random selection from the universe of entities). The probability that the correctly matching entity record is in the block is 1, and the probability of each dictionary/block element being an identity match with the query is \(1 / |D|\).
- That is, with perfect blocking, the probability of each entity being the identity match depends only on the block size.
- Where blocking is based on, say, name, the block size is the frequency of the name in the dictionary.
- Where blocking is less then perfect the probability that the entity corresponding to the query record is contained in the block will be less than 1.
This suggests that the probability of each record in the block being the correct match could be calculated as the product of conditional probabilities:

\[ P(id(q) = id(d_i)) = P(id(q) = id(d_i) | d_i \in B_j) P(d_i \in B_j | d_i \in D) P(d_i \in D) \]
What we need are models to estimate those component probabilities. I suspect that the cardinalities of those blocks would be very strong predictors of the probabilities. If the blocks are defined in terms of equality of names then the name frequencies determine the block cardinalities.
If there were multiple independent blockings, the probability of correct match of each record in the intersection of the blocks is the product of the probabilities associated with each block. This is equivalent to naive Bayes and shows the equivalence between construing the problem as a multivariate regression and construing it as using multiple blocking variables.
- Using regression rather than naive Bayes allows compensating for the blocking variables not being independent.

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] DiagrammeR_1.0.6.1 here_1.0.1         workflowr_1.6.2   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6         pillar_1.5.0       compiler_4.0.3     later_1.1.0.1     
 [5] RColorBrewer_1.1-2 git2r_0.28.0       tools_4.0.3        digest_0.6.27     
 [9] jsonlite_1.7.2     evaluate_0.14      lifecycle_1.0.0    tibble_3.1.0      
[13] pkgconfig_2.0.3    rlang_0.4.10       rstudioapi_0.13    yaml_2.2.1        
[17] xfun_0.21          stringr_1.4.0      knitr_1.31         fs_1.5.0          
[21] vctrs_0.3.6        htmlwidgets_1.5.3  rprojroot_2.0.2    glue_1.4.2        
[25] R6_2.5.0           fansi_0.4.2        rmarkdown_2.7      bookdown_0.21     
[29] magrittr_2.0.1     whisker_0.4        promises_1.2.0.1   ellipsis_0.3.1    
[33] htmltools_0.5.1.1  renv_0.13.0        httpuv_1.5.5       utf8_1.1.4        
[37] stringi_1.5.3      visNetwork_2.0.9   crayon_1.4.1

Notes

Ross Gayler