Last updated: 2021-01-15
Checks: 7 0
Knit directory:
fa_sim_cal/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201104)
was run prior to running the code in the R Markdown file.
Setting a seed ensures that any results that rely on randomness, e.g.
subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 7021bf2. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the
analysis have been committed to Git prior to generating the results (you can
use wflow_publish
or wflow_git_commit
). workflowr only
checks the R Markdown file, but you know if there are other scripts or data
files that it depends on. Below is the status of the Git repository when the
results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .tresorit/
Ignored: data/VR_20051125.txt.xz
Ignored: output/ent_cln.fst
Ignored: output/ent_raw.fst
Ignored: renv/library/
Ignored: renv/staging/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made
to the R Markdown (analysis/01-6_clean_vars.Rmd
) and HTML (docs/01-6_clean_vars.html
)
files. If you’ve configured a remote Git repository (see
?wflow_git_remote
), click on the hyperlinks in the table below to
view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | c674a51 | Ross Gayler | 2021-01-15 | Add 01-6 clean vars |
# Set up the project environment, because each Rmd file knits in a new R session
# so doesn't get the project setup from .Rprofile
# Project setup
library(here)
source(here::here("code", "setup_project.R"))
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.0.4 ✓ dplyr 1.0.2
✓ tidyr 1.1.2 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
# Extra set up for the 01*.Rmd notebooks
source(here::here("code", "setup_01.R"))
Attaching package: 'glue'
The following object is masked from 'package:dplyr':
collapse
# Extra set up for this notebook
# ???
# start the execution time clock
tictoc::tic("Computation time (excl. render)")
The 01*.Rmd
notebooks read the data, filter it to the subset to be
used for modelling, characterise it to understand it, check for possible
gotchas, clean it, and save it for the analyses proper.
This notebook (01-6_clean_vars
) prepares all the variables for use in
predictive modelling and saves the data.
The name variables, last_name
, first_name
, and midl_name
, will
definitely be used in compatibility modelling.
We intend to use the one snapshot file as both the database to be queried and as the set of queries. Consequently, strictly speaking, we don’t need to standardise the name variables because the database and query records are guaranteed to be identical (they will literally be the same record). However, we will look at the name variables with an eye to standardisation because it is never a good idea to statistically model data without having an idea about the quality of the data. We will apply some basic standardisation to the name variables, if appropriate, because it parallels what would be necessary in practice.
The demographic variables sex
, age
, birt_place
, and the
administrative variable county_id
may be used as predictors and/or
blocking variables.
The remainder of the variables (residence and administrative variables) will be kept in case they are useful for manually assessing claimed matches.
Standardisation will be applied to the name variables last_name
,
first_name
, and midl_name
. This attempts to remove variation that is
probably irrelevant to identity (e.g. case, punctuation, and spacing).
In the previous notebooks I have converted empty strings to missing
values (NA_character_
in R). This was convenient because table()
and
skim()
count missing values as a separate category. However, modelling
is a different kettle of fish.
In modelling, we want to get an estimated probability of identity match
for every query, regardless of how many attributes have missing values.
Typical modelling functions do not tolerate any missing (NA
) values in
predictors. If any of the predictors is missing then the estimate is
also missing.
We avoid that problem by transforming the missing values into some
nonmissing value and creating an extra variable to indicate the
missingness. This will be done for the name variables last_name
,
first_name
, and midl_name
, and the demographic variable age
. (This
is not necessary for birth_place
because “missing” is just another
valid level of the variable..
The cleanup actions to be applied are (in order):
age
preprocess all character variables
all names
last name
middle name
first name
postprocess all name variables
Read the usable data. Remember that this consists of only the ACTIVE & VERIFIED records.
# Show the entity data file location
# This is set in code/file_paths.R
fs::path_file(f_entity_raw_fst)
[1] "ent_raw.fst"
# get entity data
d <- fst::read_fst(f_entity_raw_fst) %>%
tibble::as_tibble() %>%
dplyr::select(-county_desc, -voter_reg_num, -sex_code) # drop redundant vars
dim(d)
[1] 4099699 23
d <- d %>%
dplyr::mutate(
age_cln = as.integer(age),
age_cln_miss = ! dplyr::between(age_cln, 17, 104), # valid age range
age_cln = dplyr::if_else(age_cln_miss, 0L, age_cln)
)
Note that this is an in-place transformation of the character variables rather than adding the transformed values as new variables. This is because I am viewing this a light tidying rather than as creating distinctly new cleaned values.
tidy_char_var <- function(x) {
x %>%
tidyr::replace_na("") %>% # map NA to ""
stringr::str_to_upper() # map lower case to upper case
}
d <- d %>%
dplyr::mutate(
across(where(is.character), tidy_char_var) # apply to all char vars
)
# map zero to O if there are no other digits in the string
map_0_to_O <- function(x) { # x: vector of strings
dplyr::if_else(
stringr::str_detect( x, "0") & # if string contains zero AND
stringr::str_detect( x, "[1-9]", negate = TRUE), # string contains no other digits
stringr::str_replace_all( x, "0", "O"), # then map zero to O
x # else return x
)
}
# apply all-name cleaning
clean_name_var <- function(x) { # x: vector of strings
x %>%
stringr::str_replace_all("[^ A-Z0-9]", " ") %>% # map non-alphanumeric to " "
stringr::str_replace_all( # fix generation suffixes
c("\\b11\\b" = "II",
"\\b111\\b" = "III",
"\\b1111\\b" = "IIII")
) %>%
map_0_to_O() %>%
stringr::str_remove_all("[0-9]") %>% # map remaining digits to ""
stringr::str_squish() # remove excess whitespace
}
d <- d %>%
dplyr::mutate(
across(
.cols = c(last_name, first_name, midl_name), # apply to all name vars
.fns = clean_name_var,
.names = "{.col}_cln")
)
# remove words (w) from vector of strings (x)
remove_words <- function(x, w) { # x, w: vectors of char (w = words to remove)
x %>%
stringr::str_remove_all(
pattern = paste0("\\b", w, "\\b", collapse = "|") #convert word list to regexp
) %>%
stringr::str_squish() # remove excess whitespace
}
d <- d %>%
dplyr::mutate(
last_name_cln = last_name_cln %>% # remove special words
remove_words(c("DR", "II", "III", "IIII", "IV", "JR", "MD", "SR")),
last_name_cln = dplyr::if_else( # remove very short names
stringr::str_length(last_name_cln) > 1,
last_name_cln,
""
)
)
d <- d %>%
dplyr::mutate(
midl_name_cln = midl_name_cln %>% # remove special words
remove_words(c("AKA", "DR", "II", "III", "IV", "JR", "MD", "MISS",
"MR", "MRS", "MS", "NMN", "NN", "REV", "SR"))
)
# if no first name, move first word of middle name to first name
move_name <- function(d) { # d: data frame of entity data
has_first_name <- d$first_name_cln != ""
re_fword <- "^[A-Z]+\\b" # regular expression for first word
midl <- d$midl_name_cln
midl_head <- midl %>% # get first word
stringr::str_extract(re_fword) %>%
tidyr::replace_na("")
midl_tail <- midl %>% # get remainder of words
stringr::str_remove(re_fword) %>%
stringr::str_squish()
d %>%
dplyr::mutate(
first_name_cln = dplyr::if_else(has_first_name,
first_name_cln,
midl_head
),
midl_name_cln = dplyr::if_else(has_first_name,
midl_name_cln,
midl_tail
)
)
}
d <- d %>%
dplyr::mutate(
first_name_cln = first_name_cln %>% # remove special words
remove_words(c("DR", "FATHER", "III", "IV", "JR", "MD", "MISS",
"MR", "MRS", "NMN", "REV", "SISTER", "SR"))
) %>%
move_name()
d <- d %>%
dplyr::mutate(
# remove all spaces
last_name_cln = last_name_cln %>% stringr::str_remove_all(" "),
first_name_cln = first_name_cln %>% stringr::str_remove_all(" "),
midl_name_cln = midl_name_cln %>% stringr::str_remove_all(" "),
# add missing value indicators
last_name_cln_miss = last_name_cln == "",
first_name_cln_miss = first_name_cln == "",
midl_name_cln_miss = midl_name_cln == ""
)
Show some examples of the cleaned data.
Quick distributions
d %>%
dplyr::select(ends_with("cln"), ends_with("_miss")) %>%
skimr::skim()
Name | Piped data |
Number of rows | 4099699 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 3 |
logical | 4 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
last_name_cln | 0 | 1 | 0 | 20 | 18 | 189999 | 0 |
first_name_cln | 0 | 1 | 0 | 18 | 15 | 124073 | 0 |
midl_name_cln | 0 | 1 | 0 | 18 | 254774 | 172071 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
age_cln_miss | 0 | 1 | 0.01 | FAL: 4068644, TRU: 31055 |
last_name_cln_miss | 0 | 1 | 0.00 | FAL: 4099681, TRU: 18 |
first_name_cln_miss | 0 | 1 | 0.00 | FAL: 4099684, TRU: 15 |
midl_name_cln_miss | 0 | 1 | 0.06 | FAL: 3844925, TRU: 254774 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age_cln | 0 | 1 | 46.3 | 17.71 | 0 | 33 | 45 | 58 | 104 | ▁▇▇▃▁ |
d %>%
dplyr::group_by(age_cln_miss) %>%
dplyr::slice_sample(n = 10) %>%
dplyr::select(age, age_cln, age_cln_miss) %>%
knitr::kable()
age | age_cln | age_cln_miss |
---|---|---|
46 | 46 | FALSE |
50 | 50 | FALSE |
28 | 28 | FALSE |
46 | 46 | FALSE |
24 | 24 | FALSE |
56 | 56 | FALSE |
70 | 70 | FALSE |
37 | 37 | FALSE |
45 | 45 | FALSE |
45 | 45 | FALSE |
125 | 0 | TRUE |
0 | 0 | TRUE |
204 | 0 | TRUE |
105 | 0 | TRUE |
105 | 0 | TRUE |
0 | 0 | TRUE |
204 | 0 | TRUE |
0 | 0 | TRUE |
0 | 0 | TRUE |
204 | 0 | TRUE |
d %>%
dplyr::group_by(last_name_cln_miss) %>%
dplyr::slice_sample(n = 10) %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
knitr::kable()
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
ROBERTS | ROBERTS | FALSE | MICHAEL | MICHAEL | FALSE | THOMAS | THOMAS | FALSE |
CASTELLOW | CASTELLOW | FALSE | FRED | FRED | FALSE | GOODWIN | GOODWIN | FALSE |
HAZELTINE | HAZELTINE | FALSE | BONNIE | BONNIE | FALSE | MARIE | MARIE | FALSE |
PITMAN | PITMAN | FALSE | CARRIE | CARRIE | FALSE | TRUE | ||
PARKER | PARKER | FALSE | LATASHA | LATASHA | FALSE | D | D | FALSE |
BYRON | BYRON | FALSE | CALLIE | CALLIE | FALSE | KAY | KAY | FALSE |
BASS | BASS | FALSE | JEFFREY | JEFFREY | FALSE | DEANE | DEANE | FALSE |
HURST | HURST | FALSE | REBECCA | REBECCA | FALSE | LYNN | LYNN | FALSE |
STALEY | STALEY | FALSE | WESSIE | WESSIE | FALSE | LEE | LEE | FALSE |
PETERS | PETERS | FALSE | ALAN | ALAN | FALSE | STUART | STUART | FALSE |
Y | TRUE | PRUM | PRUM | FALSE | TRUE | |||
M | TRUE | COY | COY | FALSE | FAY | FAY | FALSE | |
K | TRUE | RICHARD | RICHARD | FALSE | V | V | FALSE | |
R | TRUE | MARY | MARY | FALSE | TRUE | |||
X | TRUE | WILLIE | WILLIE | FALSE | LARRY | LARRY | FALSE | |
X | TRUE | MARCUS | MARCUS | FALSE | TRUE | |||
S | TRUE | PETER | PETER | FALSE | THOMAS | THOMAS | FALSE | |
R | TRUE | ANDREW | ANDREW | FALSE | PERNELL | PERNELL | FALSE | |
K | TRUE | HOA | HOA | FALSE | HIEP | HIEP | FALSE | |
H | TRUE | MOIH | MOIH | FALSE | TRUE |
d %>%
dplyr::filter(stringr::str_detect(last_name, "[- ']")) %>%
dplyr::slice_sample(n = 20) %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
knitr::kable()
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
LA LONDE | LALONDE | FALSE | MARY | MARY | FALSE | ANN KENNEDY | ANNKENNEDY | FALSE |
FISHER-BORNE | FISHERBORNE | FALSE | CHANTELLE | CHANTELLE | FALSE | MARY | MARY | FALSE |
OLLIS-SILVERS | OLLISSILVERS | FALSE | HOLLY | HOLLY | FALSE | MELISSA | MELISSA | FALSE |
TEMPLE-HAGE | TEMPLEHAGE | FALSE | BARBARA | BARBARA | FALSE | ANN | ANN | FALSE |
PABELLON-DELANDO | PABELLONDELANDO | FALSE | EDGAR | EDGAR | FALSE | ORLANDO | ORLANDO | FALSE |
VAN DYKE | VANDYKE | FALSE | JAMES | JAMES | FALSE | CHARLES | CHARLES | FALSE |
SMITH-CARTER | SMITHCARTER | FALSE | AMY | AMY | FALSE | TRUE | ||
BOYD-CURRY | BOYDCURRY | FALSE | MALISSA | MALISSA | FALSE | ANN | ANN | FALSE |
HARRIS-ALLEN | HARRISALLEN | FALSE | ERVETTE | ERVETTE | FALSE | TRUE | ||
ST ARNOLD | STARNOLD | FALSE | SONYA | SONYA | FALSE | MICHELLE | MICHELLE | FALSE |
DUARTE-CRUZ | DUARTECRUZ | FALSE | ROSALINO | ROSALINO | FALSE | TRUE | ||
VON RUPP | VONRUPP | FALSE | MICHAEL | MICHAEL | FALSE | WILLIAM | WILLIAM | FALSE |
THOMPSON-MILES | THOMPSONMILES | FALSE | GINA | GINA | FALSE | TRUE | ||
VANDER BORGH | VANDERBORGH | FALSE | MARK | MARK | FALSE | A | A | FALSE |
CASTANO-SCHULTZ | CASTANOSCHULTZ | FALSE | SUSANA | SUSANA | FALSE | JULIA | JULIA | FALSE |
BARNES-PACE | BARNESPACE | FALSE | SHARON | SHARON | FALSE | LEE | LEE | FALSE |
CULBRETH JR | CULBRETH | FALSE | WALTER | WALTER | FALSE | E | E | FALSE |
JONES-MBEMBA | JONESMBEMBA | FALSE | LARHONDA | LARHONDA | FALSE | MICHELLE | MICHELLE | FALSE |
ESTEBAN-VILLARREAL | ESTEBANVILLARREAL | FALSE | DEBBIE | DEBBIE | FALSE | E | E | FALSE |
GARFIELD-JEFFERSON | GARFIELDJEFFERSON | FALSE | JAMES | JAMES | FALSE | TRUE |
d %>%
dplyr::group_by(first_name_cln_miss) %>%
dplyr::slice_sample(n = 10) %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
knitr::kable()
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
JENKS | JENKS | FALSE | ROBIN | ROBIN | FALSE | O | O | FALSE |
GLEN | GLEN | FALSE | KENNETH | KENNETH | FALSE | W | W | FALSE |
HOLT | HOLT | FALSE | TERESA | TERESA | FALSE | EAST | EAST | FALSE |
BURRIS | BURRIS | FALSE | RHONDA | RHONDA | FALSE | KNIGHT | KNIGHT | FALSE |
HAMILTON | HAMILTON | FALSE | LINDA | LINDA | FALSE | W | W | FALSE |
JONES | JONES | FALSE | LEROY | LEROY | FALSE | DARELL | DARELL | FALSE |
HANEY | HANEY | FALSE | SARAH | SARAH | FALSE | TAYLOR | TAYLOR | FALSE |
VARNUM | VARNUM | FALSE | APRIL | APRIL | FALSE | ROSE | ROSE | FALSE |
BULLARD | BULLARD | FALSE | CHRISTOPHER | CHRISTOPHER | FALSE | LEE | LEE | FALSE |
JACKSON | JACKSON | FALSE | LYDIA | LYDIA | FALSE | ANN | ANN | FALSE |
LUU | LUU | FALSE | MRS | TRUE | TRUE | |||
FATE | FATE | FALSE | MR | TRUE | TRUE | |||
MALIK | MALIK | FALSE | TRUE | TRUE | ||||
BURGESS | BURGESS | FALSE | TRUE | TRUE | ||||
PHOENIX | PHOENIX | FALSE | TRUE | TRUE | ||||
MAGENTA | MAGENTA | FALSE | TRUE | TRUE | ||||
TOOLE | TOOLE | FALSE | JR | TRUE | TRUE | |||
AMEN | AMEN | FALSE | TRUE | TRUE | ||||
GRAYWOLF | GRAYWOLF | FALSE | TRUE | TRUE | ||||
ELSASS | ELSASS | FALSE | TRUE | TRUE |
d %>%
dplyr::filter(stringr::str_detect(first_name, "[- ']")) %>%
dplyr::slice_sample(n = 20) %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
knitr::kable()
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
ZHENG | ZHENG | FALSE | HONG YU | HONGYU | FALSE | TRUE | ||
CALDWELL | CALDWELL | FALSE | REVELLIUS SUE | REVELLIUSSUE | FALSE | TRUE | ||
HOEY | HOEY | FALSE | AN-TWAN | ANTWAN | FALSE | FERECE | FERECE | FALSE |
JOHNSON | JOHNSON | FALSE | MARY ALICE | MARYALICE | FALSE | C | C | FALSE |
BAILEY | BAILEY | FALSE | BETTY JEAN | BETTYJEAN | FALSE | BUNN | BUNN | FALSE |
BEST | BEST | FALSE | LILLIAN M | LILLIANM | FALSE | D | D | FALSE |
SOMMERS | SOMMERS | FALSE | DONNA-LEE | DONNALEE | FALSE | ANTOINETTE | ANTOINETTE | FALSE |
IWANIK | IWANIK | FALSE | MARY HELEN | MARYHELEN | FALSE | MAROCCHI | MAROCCHI | FALSE |
ALLEN | ALLEN | FALSE | BILLIE JO | BILLIEJO | FALSE | BROWN | BROWN | FALSE |
BURNETTE | BURNETTE | FALSE | MARY ELLEN | MARYELLEN | FALSE | TRUE | ||
WOOD | WOOD | FALSE | D KATHLEEN | DKATHLEEN | FALSE | TRUE | ||
KIMBERLIN | KIMBERLIN | FALSE | JO ANN | JOANN | FALSE | TRUE | ||
THORNTON | THORNTON | FALSE | MAE BELLE | MAEBELLE | FALSE | MABE | MABE | FALSE |
HELO | HELO | FALSE | BASSAM HELES | BASSAMHELES | FALSE | FAHEED | FAHEED | FALSE |
HARTSO | HARTSO | FALSE | ANITA FAYE | ANITAFAYE | FALSE | STARNES | STARNES | FALSE |
LOGAN | LOGAN | FALSE | JO ANN | JOANN | FALSE | PEACOCK | PEACOCK | FALSE |
JOHNSON | JOHNSON | FALSE | D L | DL | FALSE | HAYES | HAYES | FALSE |
MORRISON | MORRISON | FALSE | WILLIE-P | WILLIEP | FALSE | MCNEILL | MCNEILL | FALSE |
VAUGHN | VAUGHN | FALSE | M CAROLINE L | MCAROLINEL | FALSE | TRUE | ||
GOSS | GOSS | FALSE | SHARI LYNN | SHARILYNN | FALSE | HOLMES | HOLMES | FALSE |
d %>%
dplyr::filter(stringr::str_detect(first_name, "SISTER")) %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
knitr::kable()
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
ROSS | ROSS | FALSE | SISTER | S | FALSE | S | TRUE | |
KELLY | KELLY | FALSE | SISTER | ANN | FALSE | ANN | TRUE | |
PEGUESE | PEGUESE | FALSE | SISTER | GIRTRUE | FALSE | GIRTRUE | TRUE | |
TANCRAITOR | TANCRAITOR | FALSE | SISTER MAXINE | MAXINE | FALSE | ELIZABETH | ELIZABETH | FALSE |
GILDEA | GILDEA | FALSE | SISTER | THERESINE | FALSE | THERESINE | TRUE |
d %>%
dplyr::group_by(midl_name_cln_miss) %>%
dplyr::slice_sample(n = 10) %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
knitr::kable()
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
JESTER | JESTER | FALSE | DAVID | DAVID | FALSE | RYAN | RYAN | FALSE |
MANN | MANN | FALSE | TERESSA | TERESSA | FALSE | V | V | FALSE |
WILLIAMSON | WILLIAMSON | FALSE | ALVIS | ALVIS | FALSE | MAURICE | MAURICE | FALSE |
BOSWELL | BOSWELL | FALSE | MATTHEW | MATTHEW | FALSE | WAYNE | WAYNE | FALSE |
RAY | RAY | FALSE | ERVIN | ERVIN | FALSE | N | N | FALSE |
MILLER | MILLER | FALSE | BILLY | BILLY | FALSE | DAVID | DAVID | FALSE |
KING | KING | FALSE | TICHICA | TICHICA | FALSE | MICHELLE | MICHELLE | FALSE |
LLOYD | LLOYD | FALSE | GEOFFREY | GEOFFREY | FALSE | ALLEN | ALLEN | FALSE |
MICHAELS | MICHAELS | FALSE | KATHLEEN | KATHLEEN | FALSE | NIELSEN | NIELSEN | FALSE |
NEWTON | NEWTON | FALSE | LAURA | LAURA | FALSE | TINSLEY | TINSLEY | FALSE |
MUNSON | MUNSON | FALSE | MARCELINA | MARCELINA | FALSE | TRUE | ||
SOKOL | SOKOL | FALSE | JACK | JACK | FALSE | TRUE | ||
MERCADO | MERCADO | FALSE | BISMARK | BISMARK | FALSE | TRUE | ||
BASEMORE | BASEMORE | FALSE | ANTHONY | ANTHONY | FALSE | TRUE | ||
GROOMS | GROOMS | FALSE | RICHARD | RICHARD | FALSE | TRUE | ||
ZHOU | ZHOU | FALSE | WEN | WEN | FALSE | TRUE | ||
WALDEN | WALDEN | FALSE | MICHAEL | MICHAEL | FALSE | TRUE | ||
LARKIN | LARKIN | FALSE | STEPHANIE | STEPHANIE | FALSE | TRUE | ||
GARRETT | GARRETT | FALSE | OPAL | OPAL | FALSE | TRUE | ||
LOFTIN | LOFTIN | FALSE | HILDA | HILDA | FALSE | TRUE |
d %>%
dplyr::filter(stringr::str_detect(midl_name, "[- ']")) %>%
dplyr::slice_sample(n = 20) %>%
dplyr::select(
last_name, last_name_cln, last_name_cln_miss,
first_name, first_name_cln, first_name_cln_miss,
midl_name, midl_name_cln, midl_name_cln_miss
) %>%
knitr::kable()
last_name | last_name_cln | last_name_cln_miss | first_name | first_name_cln | first_name_cln_miss | midl_name | midl_name_cln | midl_name_cln_miss |
---|---|---|---|---|---|---|---|---|
REED | REED | FALSE | MARY | MARY | FALSE | L HERMAN | LHERMAN | FALSE |
CAMPBELL | CAMPBELL | FALSE | MATTIE | MATTIE | FALSE | L B | LB | FALSE |
GADDY | GADDY | FALSE | CECILIA | CECILIA | FALSE | KAY LIPSCOMB | KAYLIPSCOMB | FALSE |
LUCKEY | LUCKEY | FALSE | NANCY | NANCY | FALSE | EFIRD RITCHIE | EFIRDRITCHIE | FALSE |
HERNANDEZ | HERNANDEZ | FALSE | FABIOLA | FABIOLA | FALSE | DE GORGONIA | DEGORGONIA | FALSE |
KRATZENBERG | KRATZENBERG | FALSE | ELIZABETH | ELIZABETH | FALSE | ANNE PETTY | ANNEPETTY | FALSE |
LOOKADOO | LOOKADOO | FALSE | VIRGINIA | VIRGINIA | FALSE | LOUISE EVERETT | LOUISEEVERETT | FALSE |
WILLIAMS | WILLIAMS | FALSE | ANN | ANN | FALSE | SARA POWELL | SARAPOWELL | FALSE |
WELLS | WELLS | FALSE | DEBORAH | DEBORAH | FALSE | KAY FRENCH | KAYFRENCH | FALSE |
GRADY | GRADY | FALSE | SARAH | SARAH | FALSE | LOUISE HOWIE | LOUISEHOWIE | FALSE |
JOHNSON | JOHNSON | FALSE | REVONDA | REVONDA | FALSE | KAY HUFFMAN | KAYHUFFMAN | FALSE |
BOUHAROUN | BOUHAROUN | FALSE | WADE | WADE | FALSE | FARRAR GARLAND | FARRARGARLAND | FALSE |
WILSON | WILSON | FALSE | NANCY | NANCY | FALSE | ELIZABETH LUZZI | ELIZABETHLUZZI | FALSE |
BACHMAN | BACHMAN | FALSE | AGNES | AGNES | FALSE | JANE BLAKEN | JANEBLAKEN | FALSE |
DAVIS | DAVIS | FALSE | MARTHA | MARTHA | FALSE | DEBREAUX EVANS | DEBREAUXEVANS | FALSE |
SEWELL | SEWELL | FALSE | DEBRA | DEBRA | FALSE | LYNN HAMPTON | LYNNHAMPTON | FALSE |
SMITH | SMITH | FALSE | KAY | KAY | FALSE | VIVA ROSS | VIVAROSS | FALSE |
HENSON | HENSON | FALSE | HELEN | HELEN | FALSE | IRENE TIMM | IRENETIMM | FALSE |
LONDON | LONDON | FALSE | JESSICA | JESSICA | FALSE | LEIGH NICOLE | LEIGHNICOLE | FALSE |
LACKEY | LACKEY | FALSE | ERIN | ERIN | FALSE | LEE KEEFE | LEEKEEFE | FALSE |
# Show the clean data file location
# This is set in code/file_paths.R
fs::path_file(f_entity_cln_fst)
[1] "ent_cln.fst"
# save the usable entity data (cheap-skate caching)
d %>% fst::write_fst(f_entity_cln_fst, compress = 100)
Computation time (excl. render): 116.793 sec elapsed
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] hexbin_1.28.2 glue_1.4.2 knitr_1.30 skimr_2.1.2
[5] fst_0.9.4 fs_1.5.0 forcats_0.5.0 stringr_1.4.0
[9] dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[13] tibble_3.0.4 ggplot2_3.3.3 tidyverse_1.3.0 tictoc_1.0
[17] here_1.0.1 workflowr_1.6.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lattice_0.20-41 lubridate_1.7.9.2 assertthat_0.2.1
[5] rprojroot_2.0.2 digest_0.6.27 repr_1.1.0 R6_2.5.0
[9] cellranger_1.1.0 backports_1.2.1 reprex_0.3.0 evaluate_0.14
[13] highr_0.8 httr_1.4.2 pillar_1.4.7 rlang_0.4.10
[17] readxl_1.3.1 rstudioapi_0.13 whisker_0.4 rmarkdown_2.6
[21] munsell_0.5.0 broom_0.7.3 compiler_4.0.3 httpuv_1.5.4
[25] modelr_0.1.8 xfun_0.20 base64enc_0.1-3 pkgconfig_2.0.3
[29] htmltools_0.5.0 tidyselect_1.1.0 bookdown_0.21 fansi_0.4.1
[33] crayon_1.3.4 dbplyr_2.0.0 withr_2.3.0 later_1.1.0.1
[37] grid_4.0.3 jsonlite_1.7.2 gtable_0.3.0 lifecycle_0.2.0
[41] DBI_1.1.0 git2r_0.28.0 magrittr_2.0.1 scales_1.1.1
[45] cli_2.2.0 stringi_1.5.3 renv_0.12.5 promises_1.1.1
[49] xml2_1.3.2 ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.6
[53] tools_4.0.3 hms_0.5.3 parallel_4.0.3 yaml_2.2.1
[57] colorspace_2.0-0 rvest_0.3.6 haven_2.3.1