Last updated: 2021-01-04

Checks: 2 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version c742979. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/d.fst
    Ignored:    renv/library/
    Ignored:    renv/staging/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/about.Rmd) and HTML (docs/about.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 856a513 Ross Gayler 2021-01-04 Build site.
html 838463a Ross Gayler 2020-12-23 Build site.
html a618d9e Ross Gayler 2020-12-23 Build site.
html 36ccc82 Ross Gayler 2020-12-13 Build site.
html d5eb60b Ross Gayler 2020-12-10 Build site.
html 01b669c Ross Gayler 2020-12-10 Build site.
html 1993afa Ross Gayler 2020-12-10 Build site.
html 73bdc5e Ross Gayler 2020-12-10 Build site.
html bc8c1cc Ross Gayler 2020-12-06 Build site.
html 2f9886a Ross Gayler 2020-12-05 Build site.
html 5f37c79 Ross Gayler 2020-11-30 Build site.
Rmd 680e328 Ross Gayler 2020-11-30 Complete the project structure
html c2e37f3 Ross Gayler 2020-11-30 Initial ndex.Rmd
html 2a722d0 Ross Gayler 2020-11-29 end of day
html 03b0a02 Ross Gayler 2020-11-04 Build site.
Rmd e163b3b Ross Gayler 2020-11-04 Start workflowr project.

This is an open, shareable, reproducible, computational research project on entity resolution.

It is the joint work of:

Given a collection of records which are each “about” one entity, entity resolution is the process of determining which records probably refer to the same entity. It is used in contexts where there is no uniquely identifying entity key on the records, so the process is forced to rely on record attributes that are associated with identity, but not uniquely determined by identity (e.g. height, weight, and eye colour as attributes of persons).

This inference of two records referring to the same entity is inherently probabilistic because it is always possible that multiple entities might have identical values on the available record attributes, and are therefore functionally identical. So, given a pair of records, we are interested in the probability that they refer to the same entity.

Entity resolution is typically conceptualised in terms of the similarity between records, and similarity is assumed to be monotonic with the probability of referring to the same entity. This project investigates the value of empirically determining the relationship between similarity and probability of co-reference. Determining the precise relationship between similarity and probability of co-reference can be seen as an example of calibration.

We also investigate whether that calibration varies as a function of other measurable quantities of the specific records being compared. For example, we could look at the frequency in the collection of the record attribute values being compared, and see whether that information can be exploited to yield better entity resolution.

Entity resolution typically uses a small number of fixed similarity functions (e.g. edit distance between strings) that are defined without reference to the specific pair of records being compared. Incorporation of other predictors, which are functions of the specific records being compared, into the calibration function can be seen as similar in spirit to having a customised similarity function for every pair of records. This parallels the practice of using subpopulation-specific model calibration functions to better combine model estimates across multiple subpopulations.