Last updated: 2021-01-15

Checks: 7 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201104)

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 7021bf2

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7021bf2. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/ent_cln.fst
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/staging/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01-6_clean_vars.Rmd) and HTML (docs/01-6_clean_vars.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	c674a51	Ross Gayler	2021-01-15	Add 01-6 clean vars

# Set up the project environment, because each Rmd file knits in a new R session
# so doesn't get the project setup from .Rprofile

# Project setup
library(here)
source(here::here("code", "setup_project.R"))

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.0.4     ✓ dplyr   1.0.2
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

# Extra set up for the 01*.Rmd notebooks
source(here::here("code", "setup_01.R"))


Attaching package: 'glue'

The following object is masked from 'package:dplyr':

    collapse

# Extra set up for this notebook
# ???

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

1 Introduction

The 01*.Rmd notebooks read the data, filter it to the subset to be used for modelling, characterise it to understand it, check for possible gotchas, clean it, and save it for the analyses proper.

This notebook (01-6_clean_vars) prepares all the variables for use in predictive modelling and saves the data.

1.1 Variable roles

The name variables, last_name, first_name, and midl_name, will definitely be used in compatibility modelling.

We intend to use the one snapshot file as both the database to be queried and as the set of queries. Consequently, strictly speaking, we don’t need to standardise the name variables because the database and query records are guaranteed to be identical (they will literally be the same record). However, we will look at the name variables with an eye to standardisation because it is never a good idea to statistically model data without having an idea about the quality of the data. We will apply some basic standardisation to the name variables, if appropriate, because it parallels what would be necessary in practice.

The demographic variables sex, age, birt_place, and the administrative variable county_id may be used as predictors and/or blocking variables.

The remainder of the variables (residence and administrative variables) will be kept in case they are useful for manually assessing claimed matches.

1.2 Cleanup for predictors

1.2.1 Name standardisation

Standardisation will be applied to the name variables last_name, first_name, and midl_name. This attempts to remove variation that is probably irrelevant to identity (e.g. case, punctuation, and spacing).

1.2.2 Missing values

In the previous notebooks I have converted empty strings to missing values (NA_character_ in R). This was convenient because table() and skim() count missing values as a separate category. However, modelling is a different kettle of fish.

In modelling, we want to get an estimated probability of identity match for every query, regardless of how many attributes have missing values. Typical modelling functions do not tolerate any missing (NA) values in predictors. If any of the predictors is missing then the estimate is also missing.

We avoid that problem by transforming the missing values into some nonmissing value and creating an extra variable to indicate the missingness. This will be done for the name variables last_name, first_name, and midl_name, and the demographic variable age. (This is not necessary for birth_place because “missing” is just another valid level of the variable..

1.2.3 Cleanup summary

The cleanup actions to be applied are (in order):

age
- convert from string to integer
- add missing value indicator and set to true if age < 17 or age > 104
- if age missing indicator is true, set age to 0
preprocess all character variables
- map missing to empty string
- map lower case letters to upper case
all names
- map each non-alphanumeric character to a space (Remove variability of punctuation while preserving word boundaries.)
- map words 11, 111, 1111 to words II, III, IIII (Correct substitution of 1 for I in generation suffixes.)
- if name contains zero and no other digits, map zero to O (Correct substitution of 0 for O in names.)
- map each digit to an empty string (Remove random digit insertions)
last name
- map words DR, II, III, IIII, IV, JR, MD, SR to empty string
- if number of letters in last name = 1, map name to empty string
middle name
- map words AKA, DR, II, III, IV, JR, MD, MISS, MR, MRS, MS, NMN, NN, REV, SR to empty string
first name
- map words DR, FATHER, III, IV, JR, MD, MISS, MR, MRS, NMN, REV, SISTER, SR to empty string
- if number of letters in first name = 0, move first word of middle name to first name
postprocess all name variables
- map all spaces to empty strings (Remove variability of spacing.)
- add missing value indicator variables for all name variables

2 Read data

Read the usable data. Remember that this consists of only the ACTIVE & VERIFIED records.

# Show the entity data file location
# This is set in code/file_paths.R
fs::path_file(f_entity_raw_fst)

[1] "ent_raw.fst"

# get entity data
d <- fst::read_fst(f_entity_raw_fst) %>% 
  tibble::as_tibble() %>% 
  dplyr::select(-county_desc, -voter_reg_num, -sex_code) # drop redundant vars

dim(d)

[1] 4099699      23

3 Apply cleanup

3.0.1 Age

convert from string to integer
add missing indicator and set to true if age < 17 or age > 104
if age missing indicator is true, set age to 0

d <- d %>% 
  dplyr::mutate(
    age_cln = as.integer(age),
    age_cln_miss = ! dplyr::between(age_cln, 17, 104), # valid age range
    age_cln = dplyr::if_else(age_cln_miss, 0L, age_cln)
  )

3.0.2 Preprocess all character variables

map missing to empty string
map lower case letters to upper case

Note that this is an in-place transformation of the character variables rather than adding the transformed values as new variables. This is because I am viewing this a light tidying rather than as creating distinctly new cleaned values.

tidy_char_var <- function(x) {
  x %>% 
    tidyr::replace_na("") %>% # map NA to ""
    stringr::str_to_upper() # map lower case to upper case
}

d <- d %>% 
  dplyr::mutate(
    across(where(is.character), tidy_char_var) # apply to all char vars
  )

3.0.3 All names

map each non-alphanumeric character to a space (Remove variability of punctuation. This preserves word boundaries.)
map words 11, 111, 1111 to words II, III, IIII (Correct substitution of 1 for I in generation suffixes.)
if name contains zero and no other digits, map zero to O (Correct substitution of 0 for O in names.)
map each digit to an empty string (Remove random digit insertions)

# map zero to O if there are no other digits in the string
map_0_to_O <- function(x) { # x: vector of strings
  dplyr::if_else(
    stringr::str_detect( x, "0") & # if string contains zero AND
      stringr::str_detect( x, "[1-9]", negate = TRUE), # string contains no other digits
    stringr::str_replace_all( x, "0", "O"), # then map zero to O
    x # else return x
  )
}

# apply all-name cleaning
clean_name_var <- function(x) { # x: vector of strings
  x %>% 
    stringr::str_replace_all("[^ A-Z0-9]", " ") %>% # map non-alphanumeric to " "
    stringr::str_replace_all( # fix generation suffixes
      c("\\b11\\b"   = "II", 
        "\\b111\\b"  = "III", 
        "\\b1111\\b" = "IIII")
    ) %>% 
    map_0_to_O() %>% 
    stringr::str_remove_all("[0-9]") %>% # map remaining digits to ""
    stringr::str_squish() # remove excess whitespace
}

d <- d %>%
  dplyr::mutate(
    across(
      .cols = c(last_name, first_name, midl_name), # apply to all name vars
      .fns = clean_name_var, 
      .names = "{.col}_cln")
  )

3.0.4 Last name

map words DR, II, III, IIII, IV, JR, MD, SR to empty string
if number of letters in last name = 1, map name to empty string

# remove words (w) from vector of strings (x)
remove_words <- function(x, w) { # x, w: vectors of char (w = words to remove)
  x %>% 
    stringr::str_remove_all(
      pattern =  paste0("\\b", w, "\\b", collapse = "|") #convert word list to regexp
    ) %>% 
    stringr::str_squish() # remove excess whitespace
}

d <- d %>% 
  dplyr::mutate(
    last_name_cln = last_name_cln %>% # remove special words
      remove_words(c("DR", "II", "III", "IIII", "IV", "JR", "MD", "SR")),
    
    last_name_cln = dplyr::if_else( # remove very short names
      stringr::str_length(last_name_cln) > 1,
      last_name_cln,
      ""
    )
  )

3.0.5 Middle name

map words AKA, DR, II, III, IV, JR, MD, MISS, MR, MRS, MS, NMN, NN, REV, SR to empty string

d <- d %>% 
  dplyr::mutate(
    midl_name_cln = midl_name_cln %>% # remove special words
      remove_words(c("AKA", "DR", "II", "III", "IV", "JR", "MD", "MISS", 
                     "MR", "MRS", "MS", "NMN", "NN", "REV", "SR"))
  )

3.0.6 First name

map words DR, FATHER, III, IV, JR, MD, MISS, MR, MRS, NMN, REV, SISTER, SR to empty string
if number of letters in first name = 0, move first word of middle name to first name

# if no first name, move first word of middle name to first name
move_name <- function(d) { # d: data frame of entity data
  has_first_name <- d$first_name_cln != ""
  
  re_fword <- "^[A-Z]+\\b" # regular expression for first word
  
  midl <- d$midl_name_cln
  
  midl_head <- midl %>% # get first word
    stringr::str_extract(re_fword) %>% 
    tidyr::replace_na("")
  
  midl_tail <- midl %>% # get remainder of words
    stringr::str_remove(re_fword) %>% 
    stringr::str_squish()
  
  d %>% 
    dplyr::mutate(
      first_name_cln = dplyr::if_else(has_first_name,
                                      first_name_cln,
                                      midl_head
      ),
      midl_name_cln = dplyr::if_else(has_first_name,
                                     midl_name_cln,
                                     midl_tail
      )
    )
}

d <- d %>% 
  dplyr::mutate(
    first_name_cln = first_name_cln %>% # remove special words
      remove_words(c("DR", "FATHER", "III", "IV", "JR", "MD", "MISS",
                     "MR", "MRS", "NMN", "REV", "SISTER", "SR"))
  ) %>% 
  move_name()

3.0.7 Postprocess all name variables

map all spaces to empty strings (Remove variability of spacing.)
add missing value indicator variables for all name variables

d <- d %>%
  dplyr::mutate(
    # remove all spaces
    last_name_cln  = last_name_cln  %>% stringr::str_remove_all(" "),
    first_name_cln = first_name_cln %>% stringr::str_remove_all(" "),
    midl_name_cln  = midl_name_cln  %>% stringr::str_remove_all(" "),
    
    # add missing value indicators
    last_name_cln_miss  = last_name_cln  == "",
    first_name_cln_miss = first_name_cln == "",
    midl_name_cln_miss  = midl_name_cln  == ""
  )

4 Examples

Show some examples of the cleaned data.

Quick distributions

d %>% 
  dplyr::select(ends_with("cln"), ends_with("_miss")) %>% 
  skimr::skim()

Table 4.1: Data summary
Name	Piped data
Number of rows	4099699
Number of columns	8
_______________________
Column type frequency:
character	3
logical	4
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	max	empty	n_unique
last_name_cln	1	20	18	189999
first_name_cln	1	18	15	124073
midl_name_cln	1	18	254774	172071

Variable type: logical

skim_variable	complete_rate	mean	count
age_cln_miss	1	0.01	FAL: 4068644, TRU: 31055
last_name_cln_miss	1	0.00	FAL: 4099681, TRU: 18
first_name_cln_miss	1	0.00	FAL: 4099684, TRU: 15
midl_name_cln_miss	1	0.06	FAL: 3844925, TRU: 254774

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age_cln	0	1	46.3	17.71	0	33	45	58	104	▁▇▇▃▁

4.1 Age

d %>% 
  dplyr::group_by(age_cln_miss) %>% 
  dplyr::slice_sample(n = 10) %>% 
  dplyr::select(age, age_cln, age_cln_miss) %>% 
  knitr::kable()

age	age_cln	age_cln_miss
46	46	FALSE
50	50	FALSE
28	28	FALSE
46	46	FALSE
24	24	FALSE
56	56	FALSE
70	70	FALSE
37	37	FALSE
45	45	FALSE
45	45	FALSE
125	0	TRUE
0	0	TRUE
204	0	TRUE
105	0	TRUE
105	0	TRUE
0	0	TRUE
204	0	TRUE
0	0	TRUE
0	0	TRUE
204	0	TRUE

4.2 Last name

d %>% 
  dplyr::group_by(last_name_cln_miss) %>% 
  dplyr::slice_sample(n = 10) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()

last_name	last_name_cln	last_name_cln_miss	first_name	first_name_cln	first_name_cln_miss	midl_name	midl_name_cln	midl_name_cln_miss
ROBERTS	ROBERTS	FALSE	MICHAEL	MICHAEL	FALSE	THOMAS	THOMAS	FALSE
CASTELLOW	CASTELLOW	FALSE	FRED	FRED	FALSE	GOODWIN	GOODWIN	FALSE
HAZELTINE	HAZELTINE	FALSE	BONNIE	BONNIE	FALSE	MARIE	MARIE	FALSE
PITMAN	PITMAN	FALSE	CARRIE	CARRIE	FALSE			TRUE
PARKER	PARKER	FALSE	LATASHA	LATASHA	FALSE	D	D	FALSE
BYRON	BYRON	FALSE	CALLIE	CALLIE	FALSE	KAY	KAY	FALSE
BASS	BASS	FALSE	JEFFREY	JEFFREY	FALSE	DEANE	DEANE	FALSE
HURST	HURST	FALSE	REBECCA	REBECCA	FALSE	LYNN	LYNN	FALSE
STALEY	STALEY	FALSE	WESSIE	WESSIE	FALSE	LEE	LEE	FALSE
PETERS	PETERS	FALSE	ALAN	ALAN	FALSE	STUART	STUART	FALSE
Y		TRUE	PRUM	PRUM	FALSE			TRUE
M		TRUE	COY	COY	FALSE	FAY	FAY	FALSE
K		TRUE	RICHARD	RICHARD	FALSE	V	V	FALSE
R		TRUE	MARY	MARY	FALSE			TRUE
X		TRUE	WILLIE	WILLIE	FALSE	LARRY	LARRY	FALSE
X		TRUE	MARCUS	MARCUS	FALSE			TRUE
S		TRUE	PETER	PETER	FALSE	THOMAS	THOMAS	FALSE
R		TRUE	ANDREW	ANDREW	FALSE	PERNELL	PERNELL	FALSE
K		TRUE	HOA	HOA	FALSE	HIEP	HIEP	FALSE
H		TRUE	MOIH	MOIH	FALSE			TRUE

d %>% 
  dplyr::filter(stringr::str_detect(last_name, "[- ']")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()

last_name	last_name_cln	last_name_cln_miss	first_name	first_name_cln	first_name_cln_miss	midl_name	midl_name_cln	midl_name_cln_miss
LA LONDE	LALONDE	FALSE	MARY	MARY	FALSE	ANN KENNEDY	ANNKENNEDY	FALSE
FISHER-BORNE	FISHERBORNE	FALSE	CHANTELLE	CHANTELLE	FALSE	MARY	MARY	FALSE
OLLIS-SILVERS	OLLISSILVERS	FALSE	HOLLY	HOLLY	FALSE	MELISSA	MELISSA	FALSE
TEMPLE-HAGE	TEMPLEHAGE	FALSE	BARBARA	BARBARA	FALSE	ANN	ANN	FALSE
PABELLON-DELANDO	PABELLONDELANDO	FALSE	EDGAR	EDGAR	FALSE	ORLANDO	ORLANDO	FALSE
VAN DYKE	VANDYKE	FALSE	JAMES	JAMES	FALSE	CHARLES	CHARLES	FALSE
SMITH-CARTER	SMITHCARTER	FALSE	AMY	AMY	FALSE			TRUE
BOYD-CURRY	BOYDCURRY	FALSE	MALISSA	MALISSA	FALSE	ANN	ANN	FALSE
HARRIS-ALLEN	HARRISALLEN	FALSE	ERVETTE	ERVETTE	FALSE			TRUE
ST ARNOLD	STARNOLD	FALSE	SONYA	SONYA	FALSE	MICHELLE	MICHELLE	FALSE
DUARTE-CRUZ	DUARTECRUZ	FALSE	ROSALINO	ROSALINO	FALSE			TRUE
VON RUPP	VONRUPP	FALSE	MICHAEL	MICHAEL	FALSE	WILLIAM	WILLIAM	FALSE
THOMPSON-MILES	THOMPSONMILES	FALSE	GINA	GINA	FALSE			TRUE
VANDER BORGH	VANDERBORGH	FALSE	MARK	MARK	FALSE	A	A	FALSE
CASTANO-SCHULTZ	CASTANOSCHULTZ	FALSE	SUSANA	SUSANA	FALSE	JULIA	JULIA	FALSE
BARNES-PACE	BARNESPACE	FALSE	SHARON	SHARON	FALSE	LEE	LEE	FALSE
CULBRETH JR	CULBRETH	FALSE	WALTER	WALTER	FALSE	E	E	FALSE
JONES-MBEMBA	JONESMBEMBA	FALSE	LARHONDA	LARHONDA	FALSE	MICHELLE	MICHELLE	FALSE
ESTEBAN-VILLARREAL	ESTEBANVILLARREAL	FALSE	DEBBIE	DEBBIE	FALSE	E	E	FALSE
GARFIELD-JEFFERSON	GARFIELDJEFFERSON	FALSE	JAMES	JAMES	FALSE			TRUE

4.3 First name

d %>% 
  dplyr::group_by(first_name_cln_miss) %>% 
  dplyr::slice_sample(n = 10) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()

last_name	last_name_cln	last_name_cln_miss	first_name	first_name_cln	first_name_cln_miss	midl_name	midl_name_cln	midl_name_cln_miss
JENKS	JENKS	FALSE	ROBIN	ROBIN	FALSE	O	O	FALSE
GLEN	GLEN	FALSE	KENNETH	KENNETH	FALSE	W	W	FALSE
HOLT	HOLT	FALSE	TERESA	TERESA	FALSE	EAST	EAST	FALSE
BURRIS	BURRIS	FALSE	RHONDA	RHONDA	FALSE	KNIGHT	KNIGHT	FALSE
HAMILTON	HAMILTON	FALSE	LINDA	LINDA	FALSE	W	W	FALSE
JONES	JONES	FALSE	LEROY	LEROY	FALSE	DARELL	DARELL	FALSE
HANEY	HANEY	FALSE	SARAH	SARAH	FALSE	TAYLOR	TAYLOR	FALSE
VARNUM	VARNUM	FALSE	APRIL	APRIL	FALSE	ROSE	ROSE	FALSE
BULLARD	BULLARD	FALSE	CHRISTOPHER	CHRISTOPHER	FALSE	LEE	LEE	FALSE
JACKSON	JACKSON	FALSE	LYDIA	LYDIA	FALSE	ANN	ANN	FALSE
LUU	LUU	FALSE	MRS		TRUE			TRUE
FATE	FATE	FALSE	MR		TRUE			TRUE
MALIK	MALIK	FALSE			TRUE			TRUE
BURGESS	BURGESS	FALSE			TRUE			TRUE
PHOENIX	PHOENIX	FALSE			TRUE			TRUE
MAGENTA	MAGENTA	FALSE			TRUE			TRUE
TOOLE	TOOLE	FALSE	JR		TRUE			TRUE
AMEN	AMEN	FALSE			TRUE			TRUE
GRAYWOLF	GRAYWOLF	FALSE			TRUE			TRUE
ELSASS	ELSASS	FALSE			TRUE			TRUE

d %>% 
  dplyr::filter(stringr::str_detect(first_name, "[- ']")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()

last_name	last_name_cln	last_name_cln_miss	first_name	first_name_cln	first_name_cln_miss	midl_name	midl_name_cln	midl_name_cln_miss
ZHENG	ZHENG	FALSE	HONG YU	HONGYU	FALSE			TRUE
CALDWELL	CALDWELL	FALSE	REVELLIUS SUE	REVELLIUSSUE	FALSE			TRUE
HOEY	HOEY	FALSE	AN-TWAN	ANTWAN	FALSE	FERECE	FERECE	FALSE
JOHNSON	JOHNSON	FALSE	MARY ALICE	MARYALICE	FALSE	C	C	FALSE
BAILEY	BAILEY	FALSE	BETTY JEAN	BETTYJEAN	FALSE	BUNN	BUNN	FALSE
BEST	BEST	FALSE	LILLIAN M	LILLIANM	FALSE	D	D	FALSE
SOMMERS	SOMMERS	FALSE	DONNA-LEE	DONNALEE	FALSE	ANTOINETTE	ANTOINETTE	FALSE
IWANIK	IWANIK	FALSE	MARY HELEN	MARYHELEN	FALSE	MAROCCHI	MAROCCHI	FALSE
ALLEN	ALLEN	FALSE	BILLIE JO	BILLIEJO	FALSE	BROWN	BROWN	FALSE
BURNETTE	BURNETTE	FALSE	MARY ELLEN	MARYELLEN	FALSE			TRUE
WOOD	WOOD	FALSE	D KATHLEEN	DKATHLEEN	FALSE			TRUE
KIMBERLIN	KIMBERLIN	FALSE	JO ANN	JOANN	FALSE			TRUE
THORNTON	THORNTON	FALSE	MAE BELLE	MAEBELLE	FALSE	MABE	MABE	FALSE
HELO	HELO	FALSE	BASSAM HELES	BASSAMHELES	FALSE	FAHEED	FAHEED	FALSE
HARTSO	HARTSO	FALSE	ANITA FAYE	ANITAFAYE	FALSE	STARNES	STARNES	FALSE
LOGAN	LOGAN	FALSE	JO ANN	JOANN	FALSE	PEACOCK	PEACOCK	FALSE
JOHNSON	JOHNSON	FALSE	D L	DL	FALSE	HAYES	HAYES	FALSE
MORRISON	MORRISON	FALSE	WILLIE-P	WILLIEP	FALSE	MCNEILL	MCNEILL	FALSE
VAUGHN	VAUGHN	FALSE	M CAROLINE L	MCAROLINEL	FALSE			TRUE
GOSS	GOSS	FALSE	SHARI LYNN	SHARILYNN	FALSE	HOLMES	HOLMES	FALSE

d %>% 
  dplyr::filter(stringr::str_detect(first_name, "SISTER")) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()

last_name	last_name_cln	last_name_cln_miss	first_name	first_name_cln	first_name_cln_miss	midl_name	midl_name_cln	midl_name_cln_miss
ROSS	ROSS	FALSE	SISTER	S	FALSE	S		TRUE
KELLY	KELLY	FALSE	SISTER	ANN	FALSE	ANN		TRUE
PEGUESE	PEGUESE	FALSE	SISTER	GIRTRUE	FALSE	GIRTRUE		TRUE
TANCRAITOR	TANCRAITOR	FALSE	SISTER MAXINE	MAXINE	FALSE	ELIZABETH	ELIZABETH	FALSE
GILDEA	GILDEA	FALSE	SISTER	THERESINE	FALSE	THERESINE		TRUE

4.4 Middle name

d %>% 
  dplyr::group_by(midl_name_cln_miss) %>% 
  dplyr::slice_sample(n = 10) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()

last_name	last_name_cln	last_name_cln_miss	first_name	first_name_cln	first_name_cln_miss	midl_name	midl_name_cln	midl_name_cln_miss
JESTER	JESTER	FALSE	DAVID	DAVID	FALSE	RYAN	RYAN	FALSE
MANN	MANN	FALSE	TERESSA	TERESSA	FALSE	V	V	FALSE
WILLIAMSON	WILLIAMSON	FALSE	ALVIS	ALVIS	FALSE	MAURICE	MAURICE	FALSE
BOSWELL	BOSWELL	FALSE	MATTHEW	MATTHEW	FALSE	WAYNE	WAYNE	FALSE
RAY	RAY	FALSE	ERVIN	ERVIN	FALSE	N	N	FALSE
MILLER	MILLER	FALSE	BILLY	BILLY	FALSE	DAVID	DAVID	FALSE
KING	KING	FALSE	TICHICA	TICHICA	FALSE	MICHELLE	MICHELLE	FALSE
LLOYD	LLOYD	FALSE	GEOFFREY	GEOFFREY	FALSE	ALLEN	ALLEN	FALSE
MICHAELS	MICHAELS	FALSE	KATHLEEN	KATHLEEN	FALSE	NIELSEN	NIELSEN	FALSE
NEWTON	NEWTON	FALSE	LAURA	LAURA	FALSE	TINSLEY	TINSLEY	FALSE
MUNSON	MUNSON	FALSE	MARCELINA	MARCELINA	FALSE			TRUE
SOKOL	SOKOL	FALSE	JACK	JACK	FALSE			TRUE
MERCADO	MERCADO	FALSE	BISMARK	BISMARK	FALSE			TRUE
BASEMORE	BASEMORE	FALSE	ANTHONY	ANTHONY	FALSE			TRUE
GROOMS	GROOMS	FALSE	RICHARD	RICHARD	FALSE			TRUE
ZHOU	ZHOU	FALSE	WEN	WEN	FALSE			TRUE
WALDEN	WALDEN	FALSE	MICHAEL	MICHAEL	FALSE			TRUE
LARKIN	LARKIN	FALSE	STEPHANIE	STEPHANIE	FALSE			TRUE
GARRETT	GARRETT	FALSE	OPAL	OPAL	FALSE			TRUE
LOFTIN	LOFTIN	FALSE	HILDA	HILDA	FALSE			TRUE

d %>% 
  dplyr::filter(stringr::str_detect(midl_name, "[- ']")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::select(
    last_name, last_name_cln, last_name_cln_miss,
    first_name, first_name_cln, first_name_cln_miss,
    midl_name, midl_name_cln, midl_name_cln_miss
  ) %>% 
  knitr::kable()

last_name	last_name_cln	last_name_cln_miss	first_name	first_name_cln	first_name_cln_miss	midl_name	midl_name_cln	midl_name_cln_miss
REED	REED	FALSE	MARY	MARY	FALSE	L HERMAN	LHERMAN	FALSE
CAMPBELL	CAMPBELL	FALSE	MATTIE	MATTIE	FALSE	L B	LB	FALSE
GADDY	GADDY	FALSE	CECILIA	CECILIA	FALSE	KAY LIPSCOMB	KAYLIPSCOMB	FALSE
LUCKEY	LUCKEY	FALSE	NANCY	NANCY	FALSE	EFIRD RITCHIE	EFIRDRITCHIE	FALSE
HERNANDEZ	HERNANDEZ	FALSE	FABIOLA	FABIOLA	FALSE	DE GORGONIA	DEGORGONIA	FALSE
KRATZENBERG	KRATZENBERG	FALSE	ELIZABETH	ELIZABETH	FALSE	ANNE PETTY	ANNEPETTY	FALSE
LOOKADOO	LOOKADOO	FALSE	VIRGINIA	VIRGINIA	FALSE	LOUISE EVERETT	LOUISEEVERETT	FALSE
WILLIAMS	WILLIAMS	FALSE	ANN	ANN	FALSE	SARA POWELL	SARAPOWELL	FALSE
WELLS	WELLS	FALSE	DEBORAH	DEBORAH	FALSE	KAY FRENCH	KAYFRENCH	FALSE
GRADY	GRADY	FALSE	SARAH	SARAH	FALSE	LOUISE HOWIE	LOUISEHOWIE	FALSE
JOHNSON	JOHNSON	FALSE	REVONDA	REVONDA	FALSE	KAY HUFFMAN	KAYHUFFMAN	FALSE
BOUHAROUN	BOUHAROUN	FALSE	WADE	WADE	FALSE	FARRAR GARLAND	FARRARGARLAND	FALSE
WILSON	WILSON	FALSE	NANCY	NANCY	FALSE	ELIZABETH LUZZI	ELIZABETHLUZZI	FALSE
BACHMAN	BACHMAN	FALSE	AGNES	AGNES	FALSE	JANE BLAKEN	JANEBLAKEN	FALSE
DAVIS	DAVIS	FALSE	MARTHA	MARTHA	FALSE	DEBREAUX EVANS	DEBREAUXEVANS	FALSE
SEWELL	SEWELL	FALSE	DEBRA	DEBRA	FALSE	LYNN HAMPTON	LYNNHAMPTON	FALSE
SMITH	SMITH	FALSE	KAY	KAY	FALSE	VIVA ROSS	VIVAROSS	FALSE
HENSON	HENSON	FALSE	HELEN	HELEN	FALSE	IRENE TIMM	IRENETIMM	FALSE
LONDON	LONDON	FALSE	JESSICA	JESSICA	FALSE	LEIGH NICOLE	LEIGHNICOLE	FALSE
LACKEY	LACKEY	FALSE	ERIN	ERIN	FALSE	LEE KEEFE	LEEKEEFE	FALSE

5 Save data

# Show the clean data file location
# This is set in code/file_paths.R
fs::path_file(f_entity_cln_fst)

[1] "ent_cln.fst"

# save the usable entity data (cheap-skate caching)
d %>% fst::write_fst(f_entity_cln_fst, compress = 100)

Timing

Computation time (excl. render): 116.793 sec elapsed

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] hexbin_1.28.2   glue_1.4.2      knitr_1.30      skimr_2.1.2    
 [5] fst_0.9.4       fs_1.5.0        forcats_0.5.0   stringr_1.4.0  
 [9] dplyr_1.0.2     purrr_0.3.4     readr_1.4.0     tidyr_1.1.2    
[13] tibble_3.0.4    ggplot2_3.3.3   tidyverse_1.3.0 tictoc_1.0     
[17] here_1.0.1      workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        lattice_0.20-41   lubridate_1.7.9.2 assertthat_0.2.1 
 [5] rprojroot_2.0.2   digest_0.6.27     repr_1.1.0        R6_2.5.0         
 [9] cellranger_1.1.0  backports_1.2.1   reprex_0.3.0      evaluate_0.14    
[13] highr_0.8         httr_1.4.2        pillar_1.4.7      rlang_0.4.10     
[17] readxl_1.3.1      rstudioapi_0.13   whisker_0.4       rmarkdown_2.6    
[21] munsell_0.5.0     broom_0.7.3       compiler_4.0.3    httpuv_1.5.4     
[25] modelr_0.1.8      xfun_0.20         base64enc_0.1-3   pkgconfig_2.0.3  
[29] htmltools_0.5.0   tidyselect_1.1.0  bookdown_0.21     fansi_0.4.1      
[33] crayon_1.3.4      dbplyr_2.0.0      withr_2.3.0       later_1.1.0.1    
[37] grid_4.0.3        jsonlite_1.7.2    gtable_0.3.0      lifecycle_0.2.0  
[41] DBI_1.1.0         git2r_0.28.0      magrittr_2.0.1    scales_1.1.1     
[45] cli_2.2.0         stringi_1.5.3     renv_0.12.5       promises_1.1.1   
[49] xml2_1.3.2        ellipsis_0.3.1    generics_0.1.0    vctrs_0.3.6      
[53] tools_4.0.3       hms_0.5.3         parallel_4.0.3    yaml_2.2.1       
[57] colorspace_2.0-0  rvest_0.3.6       haven_2.3.1

01-6_clean_vars

Clean all the variables

Ross Gayler

2021-01-13