Last updated: 2021-01-15

Checks: 7 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version a9a7cf4. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/ent_cln.fst
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  analysis/02-1_block_vars.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01-6_clean_vars.Rmd) and HTML (docs/01-6_clean_vars.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd a9a7cf4 Ross Gayler 2021-01-15 Fix typo - birt_place
html 7adf4ed Ross Gayler 2021-01-15 Build site.
Rmd c674a51 Ross Gayler 2021-01-15 Add 01-6 clean vars

# Set up the project environment, because each Rmd file knits in a new R session
# so doesn't get the project setup from .Rprofile

# Project setup
library(here)
source(here::here("code", "setup_project.R"))
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.0.4     ✓ dplyr   1.0.2
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
# Extra set up for the 01*.Rmd notebooks
source(here::here("code", "setup_01.R"))

Attaching package: 'glue'
The following object is masked from 'package:dplyr':

    collapse
# Extra set up for this notebook
# ???

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

1 Introduction

The 01*.Rmd notebooks read the data, filter it to the subset to be used for modelling, characterise it to understand it, check for possible gotchas, clean it, and save it for the analyses proper.

This notebook (01-6_clean_vars) prepares all the variables for use in predictive modelling and saves the data.

1.1 Variable roles

The name variables, last_name, first_name, and midl_name, will definitely be used in compatibility modelling.

We intend to use the one snapshot file as both the database to be queried and as the set of queries. Consequently, strictly speaking, we don’t need to standardise the name variables because the database and query records are guaranteed to be identical (they will literally be the same record). However, we will look at the name variables with an eye to standardisation because it is never a good idea to statistically model data without having an idea about the quality of the data. We will apply some basic standardisation to the name variables, if appropriate, because it parallels what would be necessary in practice.

The demographic variables sex, age, birth_place, and the administrative variable county_id may be used as predictors and/or blocking variables.

The remainder of the variables (residence and administrative variables) will be kept in case they are useful for manually assessing claimed matches.

1.2 Cleanup for predictors

1.2.1 Name standardisation

Standardisation will be applied to the name variables last_name, first_name, and midl_name. This attempts to remove variation that is probably irrelevant to identity (e.g. case, punctuation, and spacing).

1.2.2 Missing values

In the previous notebooks I have converted empty strings to missing values (NA_character_ in R). This was convenient because table() and skim() count missing values as a separate category. However, modelling is a different kettle of fish.

In modelling, we want to get an estimated probability of identity match for every query, regardless of how many attributes have missing values. Typical modelling functions do not tolerate any missing (NA) values in predictors. If any of the predictors is missing then the estimate is also missing.

We avoid that problem by transforming the missing values into some nonmissing value and creating an extra variable to indicate the missingness. This will be done for the name variables last_name, first_name, and midl_name, and the demographic variable age. (This is not necessary for birth_place because “missing” is just another valid level of the variable..

1.2.3 Cleanup summary

The cleanup actions to be applied are (in order):

  • age

    • convert from string to integer
    • add missing value indicator and set to true if age < 17 or age > 104
    • if age missing indicator is true, set age to 0
  • preprocess all character variables

    • map missing to empty string
    • map lower case letters to upper case
  • all names

    • map each non-alphanumeric character to a space (Remove variability of punctuation while preserving word boundaries.)
    • map words 11, 111, 1111 to words II, III, IIII (Correct substitution of 1 for I in generation suffixes.)
    • if name contains zero and no other digits, map zero to O (Correct substitution of 0 for O in names.)
    • map each digit to an empty string (Remove random digit insertions)
  • last name

    • map words DR, II, III, IIII, IV, JR, MD, SR to empty string
    • if number of letters in last name = 1, map name to empty string
  • middle name

    • map words AKA, DR, II, III, IV, JR, MD, MISS, MR, MRS, MS, NMN, NN, REV, SR to empty string
  • first name

    • map words DR, FATHER, III, IV, JR, MD, MISS, MR, MRS, NMN, REV, SISTER, SR to empty string
    • if number of letters in first name = 0, move first word of middle name to first name
  • postprocess all name variables

    • map all spaces to empty strings (Remove variability of spacing.)
    • add missing value indicator variables for all name variables

2 Read data

Read the usable data. Remember that this consists of only the ACTIVE & VERIFIED records.

# Show the entity data file location
# This is set in code/file_paths.R
fs::path_file(f_entity_raw_fst)
[1] "ent_raw.fst"
# get entity data
d <- fst::read_fst(f_entity_raw_fst) %>% 
  tibble::as_tibble() %>% 
  dplyr::select(-county_desc, -voter_reg_num, -sex_code) # drop redundant vars

dim(d)
[1] 4099699      23

3 Apply cleanup

3.0.1 Age

  • convert from string to integer
  • add missing indicator and set to true if age < 17 or age > 104
  • if age missing indicator is true, set age to 0
d <- d %>% 
  dplyr::mutate(
    age_cln = as.integer(age),
    age_cln_miss = ! dplyr::between(age_cln, 17, 104), # valid age range
    age_cln = dplyr::if_else(age_cln_miss, 0L, age_cln)
  )

3.0.2 Preprocess all character variables

  • map missing to empty string
  • map lower case letters to upper case

Note that this is an in-place transformation of the character variables rather than adding the transformed values as new variables. This is because I am viewing this a light tidying rather than as creating distinctly new cleaned values.

tidy_char_var <- function(x) {
  x %>% 
    tidyr::replace_na("") %>% # map NA to ""
    stringr::str_to_upper() # map lower case to upper case
}

d <- d %>% 
  dplyr::mutate(
    across(where(is.character), tidy_char_var) # apply to all char vars
  )

3.0.3 All names

  • map each non-alphanumeric character to a space (Remove variability of punctuation. This preserves word boundaries.)
  • map words 11, 111, 1111 to words II, III, IIII (Correct substitution of 1 for I in generation suffixes.)
  • if name contains zero and no other digits, map zero to O (Correct substitution of 0 for O in names.)
  • map each digit to an empty string (Remove random digit insertions)
# map zero to O if there are no other digits in the string
map_0_to_O <- function(x) { # x: vector of strings
  dplyr::if_else(
    stringr::str_detect( x, "0") & # if string contains zero AND
      stringr::str_detect( x, "[1-9]", negate = TRUE), # string contains no other digits
    stringr::str_replace_all( x, "0", "O"), # then map zero to O
    x # else return x
  )
}

# apply all-name cleaning
clean_name_var <- function(x) { # x: vector of strings
  x %>% 
    stringr::str_replace_all("[^ A-Z0-9]", " ") %>% # map non-alphanumeric to " "
    stringr::str_replace_all( # fix generation suffixes
      c("\\b11\\b"   = "II", 
        "\\b111\\b"  = "III", 
        "\\b1111\\b" = "IIII")
    ) %>% 
    map_0_to_O() %>% 
    stringr::str_remove_all("[0-9]") %>% # map remaining digits to ""
    stringr::str_squish() # remove excess whitespace
}

d <- d %>%
  dplyr::mutate(
    across(
      .cols = c(last_name, first_name, midl_name), # apply to all name vars
      .fns = clean_name_var, 
      .names = "{.col}_cln")
  )

3.0.4 Last name

  • map words DR, II, III, IIII, IV, JR, MD, SR to empty string
  • if number of letters in last name = 1, map name to empty string
# remove words (w) from vector of strings (x)
remove_words <- function(x, w) { # x, w: vectors of char (w = words to remove)
  x %>% 
    stringr::str_remove_all(
      pattern =  paste0("\\b", w, "\\b", collapse = "|") #convert word list to regexp
    ) %>% 
    stringr::str_squish() # remove excess whitespace
}

d <- d %>% 
  dplyr::mutate(
    last_name_cln = last_name_cln %>% # remove special words
      remove_words(c("DR", "II", "III", "IIII", "IV", "JR", "MD", "SR")),
    
    last_name_cln = dplyr::if_else( # remove very short names
      stringr::str_length(last_name_cln) > 1,
      last_name_cln,
      ""
    )
  )

3.0.5 Middle name

  • map words AKA, DR, II, III, IV, JR, MD, MISS, MR, MRS, MS, NMN, NN, REV, SR to empty string
d <- d %>% 
  dplyr::mutate(
    midl_name_cln = midl_name_cln %>% # remove special words
      remove_words(c("AKA", "DR", "II", "III", "IV", "JR", "MD", "MISS", 
                     "MR", "MRS", "MS", "NMN", "NN", "REV", "SR"))
  )

3.0.6 First name

  • map words DR, FATHER, III, IV, JR, MD, MISS, MR, MRS, NMN, REV, SISTER, SR to empty string
  • if number of letters in first name = 0, move first word of middle name to first name
# if no first name, move first word of middle name to first name
move_name <- function(d) { # d: data frame of entity data
  has_first_name <- d$first_name_cln != ""
  
  re_fword <- "^[A-Z]+\\b" # regular expression for first word
  
  midl <- d$midl_name_cln
  
  midl_head <- midl %>% # get first word
    stringr::str_extract(re_fword) %>% 
    tidyr::replace_na("")
  
  midl_tail <- midl %>% # get remainder of words
    stringr::str_remove(re_fword) %>% 
    stringr::str_squish()
  
  d %>% 
    dplyr::mutate(
      first_name_cln = dplyr::if_else(has_first_name,
                                      first_name_cln,
                                      midl_head
      ),
      midl_name_cln = dplyr::if_else(has_first_name,
                                     midl_name_cln,
                                     midl_tail
      )
    )
}

d <- d %>% 
  dplyr::mutate(
    first_name_cln = first_name_cln %>% # remove special words
      remove_words(c("DR", "FATHER", "III", "IV", "JR", "MD", "MISS",
                     "MR", "MRS", "NMN", "REV", "SISTER", "SR"))
  ) %>% 
  move_name()

3.0.7 Postprocess all name variables

  • map all spaces to empty strings (Remove variability of spacing.)
  • add missing value indicator variables for all name variables
d <- d %>%
  dplyr::mutate(
    # remove all spaces
    last_name_cln  = last_name_cln  %>% stringr::str_remove_all(" "),
    first_name_cln = first_name_cln %>% stringr::str_remove_all(" "),
    midl_name_cln  = midl_name_cln  %>% stringr::str_remove_all(" "),
    
    # add missing value indicators
    last_name_cln_miss  = last_name_cln  == "",
    first_name_cln_miss = first_name_cln == "",
    midl_name_cln_miss  = midl_name_cln  == ""
  )

4 Examples

Show some examples of the cleaned data.

Quick distributions

d %>% 
  dplyr::select(ends_with("cln"), ends_with("_miss")) %>% 
  skimr::skim()
Table 4.1: Data summary
Name Piped data
Number of rows 4099699
Number of columns 8
_______________________
Column type frequency:
character 3
logical 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
last_name_cln 0 1 0 20 18 189999 0
first_name_cln 0 1 0 18 15 124073 0
midl_name_cln 0 1 0 18 254774 172071 0

Variable type: logical

skim_variable n_missing complete_rate mean count
age_cln_miss 0 1 0.01 FAL: 4068644, TRU: 31055
last_name_cln_miss 0 1 0.00 FAL: 4099681, TRU: 18
first_name_cln_miss 0 1 0.00 FAL: 4099684, TRU: 15
midl_name_cln_miss 0 1 0.06 FAL: 3844925, TRU: 254774

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age_cln 0 1 46.3 17.71 0 33 45 58 104 ▁▇▇▃▁

4.1 Age

d %>% 
  dplyr::group_by(age_cln_miss) %>% 
  dplyr::slice_sample(n = 10) %>% 
  dplyr::select(age, age_cln, age_cln_miss) %>% 
  knitr::kable()
age age_cln age_cln_miss
46 46 FALSE
50 50 FALSE
28 28 FALSE
46 46 FALSE
24 24 FALSE
56 56 FALSE
70 70 FALSE
37 37 FALSE
45 45 FALSE
45 45 FALSE
125 0 TRUE
0 0 TRUE
204 0 TRUE
105 0 TRUE
105 0 TRUE
0 0 TRUE
204 0 TRUE
0 0 TRUE
0 0 TRUE
204 0 TRUE

4.2 Last name

d %>% 
  dplyr::group_by(last_name_cln_miss) %>% 
  dplyr::slice_sample(n = 10) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
ROBERTS ROBERTS FALSE MICHAEL MICHAEL FALSE THOMAS THOMAS FALSE
CASTELLOW CASTELLOW FALSE FRED FRED FALSE GOODWIN GOODWIN FALSE
HAZELTINE HAZELTINE FALSE BONNIE BONNIE FALSE MARIE MARIE FALSE
PITMAN PITMAN FALSE CARRIE CARRIE FALSE TRUE
PARKER PARKER FALSE LATASHA LATASHA FALSE D D FALSE
BYRON BYRON FALSE CALLIE CALLIE FALSE KAY KAY FALSE
BASS BASS FALSE JEFFREY JEFFREY FALSE DEANE DEANE FALSE
HURST HURST FALSE REBECCA REBECCA FALSE LYNN LYNN FALSE
STALEY STALEY FALSE WESSIE WESSIE FALSE LEE LEE FALSE
PETERS PETERS FALSE ALAN ALAN FALSE STUART STUART FALSE
Y TRUE PRUM PRUM FALSE TRUE
M TRUE COY COY FALSE FAY FAY FALSE
K TRUE RICHARD RICHARD FALSE V V FALSE
R TRUE MARY MARY FALSE TRUE
X TRUE WILLIE WILLIE FALSE LARRY LARRY FALSE
X TRUE MARCUS MARCUS FALSE TRUE
S TRUE PETER PETER FALSE THOMAS THOMAS FALSE
R TRUE ANDREW ANDREW FALSE PERNELL PERNELL FALSE
K TRUE HOA HOA FALSE HIEP HIEP FALSE
H TRUE MOIH MOIH FALSE TRUE
d %>% 
  dplyr::filter(stringr::str_detect(last_name, "[- ']")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
LA LONDE LALONDE FALSE MARY MARY FALSE ANN KENNEDY ANNKENNEDY FALSE
FISHER-BORNE FISHERBORNE FALSE CHANTELLE CHANTELLE FALSE MARY MARY FALSE
OLLIS-SILVERS OLLISSILVERS FALSE HOLLY HOLLY FALSE MELISSA MELISSA FALSE
TEMPLE-HAGE TEMPLEHAGE FALSE BARBARA BARBARA FALSE ANN ANN FALSE
PABELLON-DELANDO PABELLONDELANDO FALSE EDGAR EDGAR FALSE ORLANDO ORLANDO FALSE
VAN DYKE VANDYKE FALSE JAMES JAMES FALSE CHARLES CHARLES FALSE
SMITH-CARTER SMITHCARTER FALSE AMY AMY FALSE TRUE
BOYD-CURRY BOYDCURRY FALSE MALISSA MALISSA FALSE ANN ANN FALSE
HARRIS-ALLEN HARRISALLEN FALSE ERVETTE ERVETTE FALSE TRUE
ST ARNOLD STARNOLD FALSE SONYA SONYA FALSE MICHELLE MICHELLE FALSE
DUARTE-CRUZ DUARTECRUZ FALSE ROSALINO ROSALINO FALSE TRUE
VON RUPP VONRUPP FALSE MICHAEL MICHAEL FALSE WILLIAM WILLIAM FALSE
THOMPSON-MILES THOMPSONMILES FALSE GINA GINA FALSE TRUE
VANDER BORGH VANDERBORGH FALSE MARK MARK FALSE A A FALSE
CASTANO-SCHULTZ CASTANOSCHULTZ FALSE SUSANA SUSANA FALSE JULIA JULIA FALSE
BARNES-PACE BARNESPACE FALSE SHARON SHARON FALSE LEE LEE FALSE
CULBRETH JR CULBRETH FALSE WALTER WALTER FALSE E E FALSE
JONES-MBEMBA JONESMBEMBA FALSE LARHONDA LARHONDA FALSE MICHELLE MICHELLE FALSE
ESTEBAN-VILLARREAL ESTEBANVILLARREAL FALSE DEBBIE DEBBIE FALSE E E FALSE
GARFIELD-JEFFERSON GARFIELDJEFFERSON FALSE JAMES JAMES FALSE TRUE

4.3 First name

d %>% 
  dplyr::group_by(first_name_cln_miss) %>% 
  dplyr::slice_sample(n = 10) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
JENKS JENKS FALSE ROBIN ROBIN FALSE O O FALSE
GLEN GLEN FALSE KENNETH KENNETH FALSE W W FALSE
HOLT HOLT FALSE TERESA TERESA FALSE EAST EAST FALSE
BURRIS BURRIS FALSE RHONDA RHONDA FALSE KNIGHT KNIGHT FALSE
HAMILTON HAMILTON FALSE LINDA LINDA FALSE W W FALSE
JONES JONES FALSE LEROY LEROY FALSE DARELL DARELL FALSE
HANEY HANEY FALSE SARAH SARAH FALSE TAYLOR TAYLOR FALSE
VARNUM VARNUM FALSE APRIL APRIL FALSE ROSE ROSE FALSE
BULLARD BULLARD FALSE CHRISTOPHER CHRISTOPHER FALSE LEE LEE FALSE
JACKSON JACKSON FALSE LYDIA LYDIA FALSE ANN ANN FALSE
LUU LUU FALSE MRS TRUE TRUE
FATE FATE FALSE MR TRUE TRUE
MALIK MALIK FALSE TRUE TRUE
BURGESS BURGESS FALSE TRUE TRUE
PHOENIX PHOENIX FALSE TRUE TRUE
MAGENTA MAGENTA FALSE TRUE TRUE
TOOLE TOOLE FALSE JR TRUE TRUE
AMEN AMEN FALSE TRUE TRUE
GRAYWOLF GRAYWOLF FALSE TRUE TRUE
ELSASS ELSASS FALSE TRUE TRUE
d %>% 
  dplyr::filter(stringr::str_detect(first_name, "[- ']")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
ZHENG ZHENG FALSE HONG YU HONGYU FALSE TRUE
CALDWELL CALDWELL FALSE REVELLIUS SUE REVELLIUSSUE FALSE TRUE
HOEY HOEY FALSE AN-TWAN ANTWAN FALSE FERECE FERECE FALSE
JOHNSON JOHNSON FALSE MARY ALICE MARYALICE FALSE C C FALSE
BAILEY BAILEY FALSE BETTY JEAN BETTYJEAN FALSE BUNN BUNN FALSE
BEST BEST FALSE LILLIAN M LILLIANM FALSE D D FALSE
SOMMERS SOMMERS FALSE DONNA-LEE DONNALEE FALSE ANTOINETTE ANTOINETTE FALSE
IWANIK IWANIK FALSE MARY HELEN MARYHELEN FALSE MAROCCHI MAROCCHI FALSE
ALLEN ALLEN FALSE BILLIE JO BILLIEJO FALSE BROWN BROWN FALSE
BURNETTE BURNETTE FALSE MARY ELLEN MARYELLEN FALSE TRUE
WOOD WOOD FALSE D KATHLEEN DKATHLEEN FALSE TRUE
KIMBERLIN KIMBERLIN FALSE JO ANN JOANN FALSE TRUE
THORNTON THORNTON FALSE MAE BELLE MAEBELLE FALSE MABE MABE FALSE
HELO HELO FALSE BASSAM HELES BASSAMHELES FALSE FAHEED FAHEED FALSE
HARTSO HARTSO FALSE ANITA FAYE ANITAFAYE FALSE STARNES STARNES FALSE
LOGAN LOGAN FALSE JO ANN JOANN FALSE PEACOCK PEACOCK FALSE
JOHNSON JOHNSON FALSE D L DL FALSE HAYES HAYES FALSE
MORRISON MORRISON FALSE WILLIE-P WILLIEP FALSE MCNEILL MCNEILL FALSE
VAUGHN VAUGHN FALSE M CAROLINE L MCAROLINEL FALSE TRUE
GOSS GOSS FALSE SHARI LYNN SHARILYNN FALSE HOLMES HOLMES FALSE
d %>% 
  dplyr::filter(stringr::str_detect(first_name, "SISTER")) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
ROSS ROSS FALSE SISTER S FALSE S TRUE
KELLY KELLY FALSE SISTER ANN FALSE ANN TRUE
PEGUESE PEGUESE FALSE SISTER GIRTRUE FALSE GIRTRUE TRUE
TANCRAITOR TANCRAITOR FALSE SISTER MAXINE MAXINE FALSE ELIZABETH ELIZABETH FALSE
GILDEA GILDEA FALSE SISTER THERESINE FALSE THERESINE TRUE

4.4 Middle name

d %>% 
  dplyr::group_by(midl_name_cln_miss) %>% 
  dplyr::slice_sample(n = 10) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
JESTER JESTER FALSE DAVID DAVID FALSE RYAN RYAN FALSE
MANN MANN FALSE TERESSA TERESSA FALSE V V FALSE
WILLIAMSON WILLIAMSON FALSE ALVIS ALVIS FALSE MAURICE MAURICE FALSE
BOSWELL BOSWELL FALSE MATTHEW MATTHEW FALSE WAYNE WAYNE FALSE
RAY RAY FALSE ERVIN ERVIN FALSE N N FALSE
MILLER MILLER FALSE BILLY BILLY FALSE DAVID DAVID FALSE
KING KING FALSE TICHICA TICHICA FALSE MICHELLE MICHELLE FALSE
LLOYD LLOYD FALSE GEOFFREY GEOFFREY FALSE ALLEN ALLEN FALSE
MICHAELS MICHAELS FALSE KATHLEEN KATHLEEN FALSE NIELSEN NIELSEN FALSE
NEWTON NEWTON FALSE LAURA LAURA FALSE TINSLEY TINSLEY FALSE
MUNSON MUNSON FALSE MARCELINA MARCELINA FALSE TRUE
SOKOL SOKOL FALSE JACK JACK FALSE TRUE
MERCADO MERCADO FALSE BISMARK BISMARK FALSE TRUE
BASEMORE BASEMORE FALSE ANTHONY ANTHONY FALSE TRUE
GROOMS GROOMS FALSE RICHARD RICHARD FALSE TRUE
ZHOU ZHOU FALSE WEN WEN FALSE TRUE
WALDEN WALDEN FALSE MICHAEL MICHAEL FALSE TRUE
LARKIN LARKIN FALSE STEPHANIE STEPHANIE FALSE TRUE
GARRETT GARRETT FALSE OPAL OPAL FALSE TRUE
LOFTIN LOFTIN FALSE HILDA HILDA FALSE TRUE
d %>% 
  dplyr::filter(stringr::str_detect(midl_name, "[- ']")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()
last_name last_name_cln last_name_cln_miss first_name first_name_cln first_name_cln_miss midl_name midl_name_cln midl_name_cln_miss
REED REED FALSE MARY MARY FALSE L HERMAN LHERMAN FALSE
CAMPBELL CAMPBELL FALSE MATTIE MATTIE FALSE L B LB FALSE
GADDY GADDY FALSE CECILIA CECILIA FALSE KAY LIPSCOMB KAYLIPSCOMB FALSE
LUCKEY LUCKEY FALSE NANCY NANCY FALSE EFIRD RITCHIE EFIRDRITCHIE FALSE
HERNANDEZ HERNANDEZ FALSE FABIOLA FABIOLA FALSE DE GORGONIA DEGORGONIA FALSE
KRATZENBERG KRATZENBERG FALSE ELIZABETH ELIZABETH FALSE ANNE PETTY ANNEPETTY FALSE
LOOKADOO LOOKADOO FALSE VIRGINIA VIRGINIA FALSE LOUISE EVERETT LOUISEEVERETT FALSE
WILLIAMS WILLIAMS FALSE ANN ANN FALSE SARA POWELL SARAPOWELL FALSE
WELLS WELLS FALSE DEBORAH DEBORAH FALSE KAY FRENCH KAYFRENCH FALSE
GRADY GRADY FALSE SARAH SARAH FALSE LOUISE HOWIE LOUISEHOWIE FALSE
JOHNSON JOHNSON FALSE REVONDA REVONDA FALSE KAY HUFFMAN KAYHUFFMAN FALSE
BOUHAROUN BOUHAROUN FALSE WADE WADE FALSE FARRAR GARLAND FARRARGARLAND FALSE
WILSON WILSON FALSE NANCY NANCY FALSE ELIZABETH LUZZI ELIZABETHLUZZI FALSE
BACHMAN BACHMAN FALSE AGNES AGNES FALSE JANE BLAKEN JANEBLAKEN FALSE
DAVIS DAVIS FALSE MARTHA MARTHA FALSE DEBREAUX EVANS DEBREAUXEVANS FALSE
SEWELL SEWELL FALSE DEBRA DEBRA FALSE LYNN HAMPTON LYNNHAMPTON FALSE
SMITH SMITH FALSE KAY KAY FALSE VIVA ROSS VIVAROSS FALSE
HENSON HENSON FALSE HELEN HELEN FALSE IRENE TIMM IRENETIMM FALSE
LONDON LONDON FALSE JESSICA JESSICA FALSE LEIGH NICOLE LEIGHNICOLE FALSE
LACKEY LACKEY FALSE ERIN ERIN FALSE LEE KEEFE LEEKEEFE FALSE

5 Save data

# Show the clean data file location
# This is set in code/file_paths.R
fs::path_file(f_entity_cln_fst)
[1] "ent_cln.fst"
# save the usable entity data (cheap-skate caching)
d %>% fst::write_fst(f_entity_cln_fst, compress = 100)

Timing

Computation time (excl. render): 126.774 sec elapsed

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] hexbin_1.28.2   glue_1.4.2      knitr_1.30      skimr_2.1.2    
 [5] fst_0.9.4       fs_1.5.0        forcats_0.5.0   stringr_1.4.0  
 [9] dplyr_1.0.2     purrr_0.3.4     readr_1.4.0     tidyr_1.1.2    
[13] tibble_3.0.4    ggplot2_3.3.3   tidyverse_1.3.0 tictoc_1.0     
[17] here_1.0.1      workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        lattice_0.20-41   lubridate_1.7.9.2 assertthat_0.2.1 
 [5] rprojroot_2.0.2   digest_0.6.27     repr_1.1.0        R6_2.5.0         
 [9] cellranger_1.1.0  backports_1.2.1   reprex_0.3.0      evaluate_0.14    
[13] highr_0.8         httr_1.4.2        pillar_1.4.7      rlang_0.4.10     
[17] readxl_1.3.1      rstudioapi_0.13   whisker_0.4       rmarkdown_2.6    
[21] munsell_0.5.0     broom_0.7.3       compiler_4.0.3    httpuv_1.5.4     
[25] modelr_0.1.8      xfun_0.20         base64enc_0.1-3   pkgconfig_2.0.3  
[29] htmltools_0.5.0   tidyselect_1.1.0  bookdown_0.21     fansi_0.4.1      
[33] crayon_1.3.4      dbplyr_2.0.0      withr_2.3.0       later_1.1.0.1    
[37] grid_4.0.3        jsonlite_1.7.2    gtable_0.3.0      lifecycle_0.2.0  
[41] DBI_1.1.0         git2r_0.28.0      magrittr_2.0.1    scales_1.1.1     
[45] cli_2.2.0         stringi_1.5.3     renv_0.12.5       promises_1.1.1   
[49] xml2_1.3.2        ellipsis_0.3.1    generics_0.1.0    vctrs_0.3.6      
[53] tools_4.0.3       hms_0.5.3         parallel_4.0.3    yaml_2.2.1       
[57] colorspace_2.0-0  rvest_0.3.6       haven_2.3.1