About

Last updated: 2021-01-04

Checks: 2 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Repository version: c742979

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version c742979. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/d.fst
    Ignored:    renv/library/
    Ignored:    renv/staging/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/about.Rmd) and HTML (docs/about.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	856a513	Ross Gayler	2021-01-04	Build site.
html	838463a	Ross Gayler	2020-12-23	Build site.
html	a618d9e	Ross Gayler	2020-12-23	Build site.
html	36ccc82	Ross Gayler	2020-12-13	Build site.
html	d5eb60b	Ross Gayler	2020-12-10	Build site.
html	01b669c	Ross Gayler	2020-12-10	Build site.
html	1993afa	Ross Gayler	2020-12-10	Build site.
html	73bdc5e	Ross Gayler	2020-12-10	Build site.
html	bc8c1cc	Ross Gayler	2020-12-06	Build site.
html	2f9886a	Ross Gayler	2020-12-05	Build site.
html	5f37c79	Ross Gayler	2020-11-30	Build site.
Rmd	680e328	Ross Gayler	2020-11-30	Complete the project structure
html	c2e37f3	Ross Gayler	2020-11-30	Initial ndex.Rmd
html	2a722d0	Ross Gayler	2020-11-29	end of day
html	03b0a02	Ross Gayler	2020-11-04	Build site.
Rmd	e163b3b	Ross Gayler	2020-11-04	Start workflowr project.

This is an open, shareable, reproducible, computational research project on entity resolution.

It is the joint work of:

Given a collection of records which are each “about” one entity, entity resolution is the process of determining which records probably refer to the same entity. It is used in contexts where there is no uniquely identifying entity key on the records, so the process is forced to rely on record attributes that are associated with identity, but not uniquely determined by identity (e.g. height, weight, and eye colour as attributes of persons).

This inference of two records referring to the same entity is inherently probabilistic because it is always possible that multiple entities might have identical values on the available record attributes, and are therefore functionally identical. So, given a pair of records, we are interested in the probability that they refer to the same entity.

Entity resolution is typically conceptualised in terms of the similarity between records, and similarity is assumed to be monotonic with the probability of referring to the same entity. This project investigates the value of empirically determining the relationship between similarity and probability of co-reference. Determining the precise relationship between similarity and probability of co-reference can be seen as an example of calibration.

We also investigate whether that calibration varies as a function of other measurable quantities of the specific records being compared. For example, we could look at the frequency in the collection of the record attribute values being compared, and see whether that information can be exploited to yield better entity resolution.

Entity resolution typically uses a small number of fixed similarity functions (e.g. edit distance between strings) that are defined without reference to the specific pair of records being compared. Incorporation of other predictors, which are functions of the specific records being compared, into the calibration function can be seen as similar in spirit to having a customised similarity function for every pair of records. This parallels the practice of using subpopulation-specific model calibration functions to better combine model estimates across multiple subpopulations.