This document is for keeping notes of any points that may be useful for later project or manuscript development and which are not covered in the analysis notebooks or at risk of getting lost in the notebooks.

1 Project infrastructure

  • Consider using the targets package to control the computational workflow.

2 Identity data

  • Get a sizeable publicly available data set with personal names (NCVR).

    • The focus of the empirical work is on string similarity metrics of names.
  • Use sex and age in addition to personal names so that most records are discriminable.

    • High frequency names will likely not be discriminable with only these attributes.

      • This is not a problem because we are really interested in whether the methods proposed here assist in quantifying the discriminability of records. We want records spanning a wide range of discriminability.
    • Age (and possibly sex) will be used as a blocking variable.

      • Blocking is probably needed to make the project computationally tractable.
    • Age and sex are also of interest in the calculation of name frequency because name distributions should vary conditional on age and sex.

  • Keep address and phone number as they may be useful for manually checking identity in otherwise nondiscriminable records.

    • As a fallback position, address and phone number can be used as discriminating attributes in the compatibility model.
  • Get the oldest available data to minimise it’s currency (NCVR 2005 snapshot).

  • Drop objectionable attributes such as race and political affiliation.

3 Identity data preparation

  • Apply basic data cleaning to the predictive attributes.

    • This is probably unnecessary given how the data will be used.

    • I can’t bring myself to model data without scrutinising it first.

  • Only keep records that are ACTIVE and VERIFIED for modelling.

    • These are likely to have the highest data quality attributes.

    • These are least likely to have duplicate records (i.e. referring to the same person).

4 Blocking

  • Use blocking to reduce the number of comparisons, to keep the computational cost feasible.

  • This project is not an operational system and we are only using blocking to reduce the computational cost, so we can choose blocking that would not be acceptable for an operational system.

  • Where the dictionary blocks vary widely in size we might choose to work with only a subset of blocks that are a suitable size.

  • If we think that some aspect of the compatibility modelling might vary as a function of block size we would probably want to test this over a wide range of block sizes.

  • We will probably be repeating analyses over some number of blocks to assess the variability of results, but that might only be a subset of blocks with no commitment to examine all the blocks.

  • Blocking variables may have missing values in the query and dictionary records.

    • This can be handled but is fiddly and not the focus of this project.
    • Handling missing predictor values in regression-based compatibility models is simple.
  • Exclude records with missing values on the blocking records from use, so that missing values don’t have to be allowed for in blocking.

  • Try to choose blocking variables with a small proportion of missing values, so as to minimise systematic bias due to their exclusion.

  • Construct a few potentially useful blocking variables.

  • Blocking can induce changes in the distributions of names.

    • Blocking on sex (in combination with other variables) will definitely give more homogeneity of first names within blocks because of gendered names.

    • Blocking on age will give more homogeneity of first names within blocks because of name popularity varying over time.

    • Blocking on county may give more homogeneity of last names within blocks because of families living together.

5 Identity data characterisation

These sections are about looking at the properties of the identity data that will be most relevant to their use in the compatibility models.

5.1 Structure induced by equality

  • Prior work on name frequency has (implicitly) only considered equality of names. That is two name tokens are either absolutely identical or absolutely different. (No gradations of difference are entertained.) Counting the frequency of names is finding the cardinality of each set of identical name tokens.

    • Name frequency will be used as a predictor in a compatibility model, \(compat(q, d_i)\).
    • For a query record \(q\) the compatibility will be estimated for every dictionary record \(d_i\) in the block \(B_q\) (the set of dictionary records in the block selected by the query).
    • The estimated compatibility will vary over \(d_i\), so we expect the predictors to be functions of \(d_i\) (in the context of the set of dictionary records, \(B_q\) selected by the query record.)
    • The (first attempt at) name equality frequency \(f_{eq}(q, d_i)\) is defined as \(\vert \{ d_j : d_j \in B_q \land name(d_j) = name(d_i)\} \vert\).
  • Look at frequency distributions of names conditional on name length. The Zipf distributions may have different shape parameters for different name lengths. Name length might be examined as an alternative to name frequency for interaction with similarity.

  • Look at frequency distributions of names conditional on age and/or sex.

    • These conditional distributions may increase the predictive power of the compatibility model.
  • Look at frequency distributions of names conditional on blocking variables.

    • This is to get an understanding of the effect of blocking on name frequency distributions.
    • In particular, look at any effects of block size (which has a very wide range).
    • The anticipated usage is that name frequency will be calculated within the dictionary block selected by the query record. (The block can be construed as the only dictionary that matters for the purposes of the query.)

5.2 Structure induced by similarity

  • The similarity version of name frequency is an extension of the equality version. It counts the number of dictionary records in the block that are at least as similar to the query record as the currently considered dictionary record.

    • The (first attempt at) name similarity frequency \(f_{sim}(q, d_i)\) is defined as \(\vert \{ d_j : d_j \in B_q \land sim(name(q), name(d_j)) \ge sim(name(q), name(d_i))\} \vert\).
  • Look at similarity frequency distributions of names. It’s not obvious that these should be Zipf distributions. For example, the rare names might be quite similar to more frequent names, which might obscure the long tail of the underlying distribution of names.

  • Look at similarity frequency distributions of names conditional on name length. The Zipf distributions may have different shape parameters for different name lengths.

    • This is of interest because similarity is usually scaled to be between 1 (equality) and 0 (completely different) regardless of the string length. The longer the strings the greater the size of the space of possible strings and the higher the dimensionality of the space. It’s not obvious to me that equality (inequality) of very short strings carries the same evidential value as equality (inequality) of long strings.
  • Look at similarity frequency distributions of names conditional on age and/or sex.

    • These conditional distributions may increase the predictive power of the compatibility model.
  • Look at similarity frequency distributions of names conditional on blocking variables.

    • This is to get an understanding of the effect of blocking on name frequency distributions.
    • In particular, look at any effects of block size (which has a very wide range).

6 Modelling

  • Try indicators for missingness. Missingness may be differentially informative across different predictor variables.

  • Try indicators for similarity == 1. The compatibility of exact string equality is not necessarily continuous with the compatibility of similarity just below 1.

  • Try name frequency as an interactive predictor variable.

    • Also consider frequency conditional on age and/or sex
  • There are two names in each lookup: dictionary and query. Therefore there are also two name frequencies to be considered. Consider how to use both frequencies (e.g. min, max, geometric mean, …).

    • Queries may contain names that do not exist in the dictionary, so we need to deal with that case.

    • Do we need to apply frequency smoothing, as used in probabilistic linguistic models?

    • Do we need to estimate the probability mass of unobserved names?

  • In general, the dictionary will be a subset of the entities in the universe of queries. Consider the impact of this on modelling as the fraction of the query universe in the dictionary varies.

    • Consider whether this varies on a per query basis because of blocking. That is, is there effectively a separate dictionary per blocking value?

    • Can the sampling fraction be modelled as a prior probability (effectively, a change in the intercept term of a logistic regression)?

    • Consider using the same variable for blocking and as a predictor to compare the effect on estimated probability of identity match as a function of dictionary fraction.

      • It is feasible to use the “same” variables for blocking and as predictors. The blocks are based on the values of the dictionary records and selected by the value in the query record. The predictor variables are properties of record pairs - so there will still be within-block variance of the predictor even when there is no within-block variance of properties of the query record. That is, they’re not the “same” variable when they are used for blocking and as a predictor.

7 Performance evaluation

  • Partition the records into a dictionary and a set of queries (the \(Q_U\) set).

    • By the (reasonable) assumption that there are no duplicate records, each of the \(Q_U\) queries will be unmatched in the dictionary.
  • Select a subset of the dictionary records to use as the \(Q_M\) query set.

    • By the (reasonable) assumption that there are no duplicate records, each of the \(Q_M\) queries will have exactly one matching record in the dictionary.
  • This is evaluation is different to the usual evaluation of entity resolution in that it doesn’t consider the impact of transcription/typographical variation in the queries.

    • It looks at the quantification of discriminability when the available attributes are not necessarily able to ensure that all records are discriminable.
  • If we are interested in the performance with respect to transcription/typographical variation we may need to consider artificially corrupting some of the queries.

  • Consider assigning some of the dictionary records to randomly chosen wrong blocks. (Is this equivalent to randomly selecting some out-of-block query records to run against each dictionary block?)

8 Writing/theory

  • Note relationship to Fellegi & Sunter / probabilistic record linkage (in proposal?)

8.1 Blocking

  • It might be possible to develop (or at least explain) most of the maths in terms of blocking.

  • Start with a universe \(E\) of entities with the queries drawn uniformly at random from that universe and the dictionary being identical to the universe.

  • In the absence of any other information, the probability of each dictionary element being an identity match with the query is $ 1 / |E| $.

  • If the dictionary \(D\) is a proper subset of \(E\) it can be thought of as a (not very informative) block. That is, a block can be thought of as the dictionary induced by the query record.

  • If \(D\) is a random subset of \(E\), and in the absence of any other information, the probability of each dictionary element being an identity match with the query is \((|D| / |E|) / |D| = 1 / |E|\).

  • \(|D|\) is the block size of the dictionary construed as a block. In the absence of any information to discriminate between records in the block, the larger the block, the lower the probability the any block record is the identity match to the query record.

  • \(|D| / |E|\) is the probability that the entity corresponding to the query record is in the dictionary/block, given that the dictionary is a uniform random subset of the universe of entities.

  • Consider the other extreme, where blocking is perfect. That is, the entity query record is guaranteed to be in the dictionary. In this case the blocking is very informative (not a random selection from the universe of entities). The probability that the correctly matching entity record is in the block is 1, and the probability of each dictionary/block element being an identity match with the query is \(1 / |D|\).

    • That is, with perfect blocking, the probability of each entity being the identity match depends only on the block size.

    • Where blocking is based on, say, name, the block size is the frequency of the name in the dictionary.

    • Where blocking is less then perfect the probability that the entity corresponding to the query record is contained in the block will be less than 1.

  • This suggests that the probability of each record in the block being the correct match could be calculated as the product of conditional probabilities:

    \[ P(id(q) = id(d_i)) = P(id(q) = id(d_i) | d_i \in B_j) P(d_i \in B_j | d_i \in D) P(d_i \in D) \]

  • What we need are models to estimate those component probabilities. I suspect that the cardinalities of those blocks would be very strong predictors of the probabilities. If the blocks are defined in terms of equality of names then the name frequencies determine the block cardinalities.

  • If there were multiple independent blockings, the probability of correct match of each record in the intersection of the blocks id the product of the probabilities associated with each block. This is equivalent to naive Bayes and shows the equivalence between construing the problem as a multivariate regression and construing it as using multiple blocking variables.

    • Using regression rather than naive Bayes allows compensating for the blocking variables not being independent.