Last updated: 2021-01-13

Checks: 7 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version d3deb84. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  analysis/01_get_check_data.Rmd.txt
    Untracked:  analysis/standardise.txt

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01-5_check_name.Rmd) and HTML (docs/01-5_check_name.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd d3deb84 Ross Gayler 2021-01-13 Add 01-5 check name

# Set up the project environment, because each Rmd file knits in a new R session
# so doesn't get the project setup from .Rprofile

# Project setup
library(here)
source(here::here("code", "setup_project.R"))
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.3     ✓ purrr   0.3.4
✓ tibble  3.0.4     ✓ dplyr   1.0.2
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
# Extra set up for the 01*.Rmd notebooks
source(here::here("code", "setup_01.R"))

Attaching package: 'glue'
The following object is masked from 'package:dplyr':

    collapse
# Extra set up for this notebook
# ???

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

1 Introduction

The 01*.Rmd notebooks read the data, filter it to the subset to be used for modelling, characterise it to understand it, check for possible gotchas, clean it, and save it for the analyses proper.

This notebook (01-5_check_name) characterises the name variables in the saved subset of the data.

These variables will be used to construct the main predictors in the compatibility models.

We intend to use the one snapshot file as both the database to be queried and as the set of queries. Consequently, strictly speaking, we don’t need to standardise the name variables because the database and query records are guaranteed to be identical (they will literally be the same record). However, we will look at the name variables with an eye to standardisation because it is never a good idea to statistically model data without having an idea about the quality of the data. We will apply some basic standardisation to the name variables, if appropriate, because it parallels what would be necessary in practice.


Define the name variables.

vars_name <- c(
  "last_name", "first_name", "midl_name", "name_sufx_cd" 
)

Read the usable data. Remember that this consists of only the ACTIVE & VERIFIED records.

# Show the entity data file location
# This is set in code/file_paths.R
f_entity_fst
[1] "/home/ross/RG/projects/academic/entity_resolution/fa_sim_cal_TOP/fa_sim_cal/output/ent_raw.fst"
# get data for next section of analyses
d <- fst::read_fst(
  f_entity_fst, 
  columns = c(vars_name, "sex") # get sex as well for cross-checking
) %>% 
  tibble::as_tibble()
dim(d)
[1] 4099699       5

Take a quick look at the distributions.

d %>% skimr::skim()
Table 1.1: Data summary
Name Piped data
Number of rows 4099699
Number of columns 5
_______________________
Column type frequency:
character 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
last_name 0 1.00 1 21 0 191996 0
first_name 23 1.00 1 19 0 126589 0
midl_name 252695 0.94 1 20 0 175742 0
name_sufx_cd 3869063 0.06 1 3 0 101 0
sex 0 1.00 3 6 0 3 0
  • last_name 100% filled
  • first_name ~100% filled (23 missing)
  • midl_name 94% filled
  • name_sufx_cd 6% filled

2 Name length

Look at the distributions of name lengths first, before moving on to analyses more focused on standardisation.

Calculate the lengths of the name variables.

x <- d %>% 
  dplyr::mutate(
    len_last = stringr::str_length(last_name),
    len_first = stringr::str_length(first_name),
    len_midl = stringr::str_length(midl_name)
  )

2.1 last_name

last_name Voter last name

Look at the distributions of name lengths.

summary(x$len_last)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   5.000   6.000   6.345   7.000  21.000 
table(x$len_last, useNA = "ifany")

      1       2       3       4       5       6       7       8       9      10 
     18    2046   53580  393363  864542 1094952  805773  514347  212379   96777 
     11      12      13      14      15      16      17      18      19      20 
  33039   12034    6844    4239    2679    1632     824     404     152      73 
     21 
      2 
x %>% 
  ggplot() +
  geom_histogram(aes(x = len_last), binwidth = 1) +
  scale_y_sqrt()

Look at examples of short names.

# length == 1
x %>% 
  dplyr::filter(len_last == 1) %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::arrange(last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name
A CHUH NA
A THEK NA
H MOIH NA
J J NA
K HOA HIEP
K NGEO NA
K NIUH NA
K RICHARD V
K SANG NA
M COY FAY
N RENEE VIVIAN
R ANDREW PERNELL
R MARY NA
S PETER THOMAS
U RAYMOND NA
X MARCUS NA
X WILLIE LARRY
Y PRUM NA
  • 1-letter last names are very rare
  • 1-letter last names are probably errors
# length == 2
x %>% 
  dplyr::filter(len_last == 2) %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name
AR ORAWAN P
DO HANH THUAN T
EA YOUNG HOW
EO YONG SUK
HA YONG S
KO LINDA KYONGSUK
LE ANDREW CHAU
LE DANH MINH
LE DU D
LE NANCY NICHOLS
LE QUANG TRAN
LU IAN MICHAEL
MA ARNOLD M
MA JAMES SUNG KAO
NG AMY L0CKAMY
VO KHANH HUU
WU KUY M
YI JI SUK NA
YU JUN HYUK
YU XIAO LI
  • Most 2-letter last names are probably valid.
  • ST is probably Saint from a multi-word last name

Look at examples of long names.

# length == 21
x %>% 
  dplyr::filter(len_last == 21) %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::arrange(last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name
ALESSANDRETTI-STRAUSS MARIA E
BREWINGTON-SUTHERLAND LISA A
  • 21-letter last names are hyphenated
# length >= 20
x %>% 
  dplyr::filter(len_last >= 20) %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name
ANASTASIOU-JOSEPHIDE THEODORA A
ARDESHIRPOUR-ZARTOSH PARVIZ NA
ARRIAGADA-VALENZUELA GONZALO ESTEBAN
BEDINGFIELD-DEMATTEO HOLLIS BEDINGFIELD
BEN MESSAOUDMESSAOUD AHMED BEN
FERRIOLA-BRUCKENSTEI ZACHARY NA
FRANKFORT-WINNINGHAM SUSAN R
GERKHARDT-GODZIEMSKI ALICE ELIZABETH
HUDSON-CHARLES-PIERR MONIQUE NA
KACZMAREK-HUFFSTETLE KIM NA
KLOCZKOWSKI-BERTRAND DAWN M
MCCUTCHEON-GUTKNECHT LISA ANN
MORRISON-WESTMORELAN DAWN IRVING
NOOHLANHLA GUGULETHE ALAMILLA NA
SCHIAPPACASSE-DEPUTY STEPHANIE E
SOTELO DE LOS SANTOS MARCOS ANTONIO
THEODORDES-GRINESTAF APRIL ARLETHA
THEODORIDES-GRINESTA APRIL ARLETHA
VALL-SPINOSA PERKINS JESSIE FAYE
WASHINGTON-HALFKENNY DAVID D
  • 20+-letter last names appear to be multi-word and/or hyphenated

2.2 first_name

first_name Voter first name

Look at the distributions of name lengths.

summary(x$len_first)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   5.000   6.000   5.913   7.000  19.000      23 
table(x$len_first, useNA = "ifany")

      1       2       3       4       5       6       7       8       9      10 
   8070    3799   99236  525505 1077727 1018768  884199  295743  135014   19359 
     11      12      13      14      15      16      17      18      19    <NA> 
  29314    1487     880     345     215       9       4       1       1      23 
x %>% 
  ggplot() +
  geom_histogram(aes(x = len_first), binwidth = 1) +
  scale_y_sqrt()
Warning: Removed 23 rows containing non-finite values (stat_bin).

Look at the missing names.

x %>% 
  dplyr::filter(is.na(first_name)) %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::arrange(last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name
ALEXANDER NA JASON
AMEN NA NA
BULLARD NA ALEXIS
BURGESS NA NA
CHESTER NA JAMES
ELSASS NA NA
FRISBY NA M
FRYE WILLIAM C NA NA
FUQUA NA MARY
FUQUA NA WILLIAM
GRAYWOLF NA NA
JUDITH NA NA
KAUCHICK NA PAULINE
MAGENTA NA NA
MALIK NA NA
MCKEEL NA LESTER
MOLET NA MICHAEL
MORRIS NA ALEXANDER
PATTERSON NA JOHN DEXTER
PHOENIX NA NA
SILVERMOON NA NA
WARREN NA NA
ZIMMER NA CLIFFORD
  • Some missing first names look like the middle name is actually the first name, e.g. ? JASON ALEXANDER
  • Some missing first names appear to have only a last name, e.g. ? ? AMEN
  • Some missing first names appear to have the entire name in the last name variable, e.g. ? ? FRYE WILLIAM C

Look at examples of short names.

# length == 1
x %>% 
  dplyr::filter(len_first == 1) %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name
ANDREWS A E
BARNETTE C V
BENFIELD J D
BOONE A C
BOSTWICK H KATHLEEN
CAMPBELL W THOMAS
HERMAN L E
HOOKER S A
MCDONALL W B MRS
MILLER J H
MILTON E D
OVERMAN R DALE
REIDENBACH W SCOTT
SMITH J C
STRAUB C WINIFRED
TOWNSEND J B
TUTTLE M GERTRUDE
WIGGINS J BELTON
WILLIAMS A BLANDENA
WILLIFORD W T
  • The 1-letter first names appear to be using an initial as the first name
# length == 2
x %>% 
  dplyr::filter(len_first == 2) %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name
BOONE JO GRAHAM
BROOKS SU TONYA CARYETTE
CALVILLO AJ NA
CLARK JO W
FAIR JD FAIR
FAULHABER JO ANN
FOWLER LA SONDA
HARVEY TY NA
HOFF MI YONG
JUDD JO D
MCKEE JO SHUMATE
MILLER AL NA
MIMS JO CHANDLER
MULLEN JO L
NGUYEN HO NGOC
NICHOLSON DE MELVIN
TANG YU NA
THOMASON JO CARPENTER
WHICHARD AL NA
WON UN T

2-letter first names appear to be:

  • Valid, e.g. JO W CLARK, HO NGOC NGUYEN
  • Part of a multi word name that has bee split across the first and middle name variables, e.g. LA SONDA FOWLER

Look at the long names.

# length >= 16
x %>% 
  dplyr::filter(len_first >= 16) %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::arrange(last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name
ANDERSON MICHAEL-CHEROKEE DEMCK
DOUPE KIMBERLY DANIELLE WYATT
ENRIQUEZ MARIA DEL CARMEN NA
FIELDS ADRIENNE`FELICIA NA
LAPPAS-KOTARA MICHELLE-ADRIENNE NA
MIDDLESWORTH ELIZABETH-LINDSAY MCCOY
NAGARAJ SANTHEBACHAHALLI S
NATARAJA HEGGADADEVANAKOTE NA
NGUYEN THI PHUONG KHAUH NA
NUNEZ MARIANA DE JESUS N
ODEMS MICHAEL-CHRISTOPHER NA
PERRY SHIRLEY ANN-PEPPER NA
RODRIGUEZ MARIA DEL CARMAN NA
SUBRAMANIAM LAKSHMINARAYANAN NA
WINKLER ELIZABETH PORTIS G

Long first names appear to be:

  • Long non-anglo names, e.g. LAKSHMINARAYANAN
  • Multi-word and/or hyphenated, e.g. ELIZABETH-LINDSAY

2.3 midl_name

midl_name Voter middle name

These names will often be missing or initials only.

Look at the distributions of name lengths.

summary(x$len_midl)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00    3.00    5.00    4.73    6.00   20.00  252695 
table(x$len_midl, useNA = "ifany")

     1      2      3      4      5      6      7      8      9     10     11 
826716  10491 289439 440549 651587 705383 508158 227267 114306  30604  20536 
    12     13     14     15     16     17     18     19     20   <NA> 
  9807   5186   3514   3379     50     21      8      2      1 252695 
x %>% 
  ggplot() +
  geom_histogram(aes(x = len_midl), binwidth = 1) +
  scale_y_sqrt()
Warning: Removed 252695 rows containing non-finite values (stat_bin).

  • Many records are missing middle name
  • Spike of 1-letter names will be initials

Look at the long names.

# lentgh >= 16
x %>% 
  dplyr::filter(len_midl >= 16) %>% 
  dplyr::select(ends_with("_name")) %>% 
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name
ARTIST SYLVIA JOYCE WIILIAMSON
BILLOW LESLEY ELIZABETH CLAUSS
BOWDEN CORA FRANCES THOMPSON
BRINKHUIS VANESSA INGRID-PRISCILLA
CALL LUNIA ANNTONIA MCCRARY
DELLA MEA CAROLYN ROBINSON
EXUM SHEILA LANENA WHITEHEAD
GANAWAY SUSAN ANN WINTERHALTER
GULLEY JOHN MARCUS DELAFAYETTE
HARRIS ANN PULLER- MARCELINE-ZO
HARTSFIELD NAOMI RUTH SATTERWHITE
HICKS NELLIE BEATRICE-RICHARDSON
HOGGARD ANN DENISE HARRINGTON
MOORE VIDA GWENEVERE BARNER
RIVERA RAFAEL ANTONIO CARAMBOT
ROGERS RUBYE REBECCA/SUDDRETH
SWINSON MARY ELIZABETH FRANCIS
WHITENER STEPHANIE LYNNE WARREN PARKER
WOOD T BENBURY HAUGHTON
YOUNGER ZEE CAMILLE PREVETTE
  • Long middle names appear to be multiple names and/or hyphenated
# clean up
rm(x)
gc()
           used  (Mb) gc trigger  (Mb)  max used  (Mb)
Ncells  1805215  96.5    2918587 155.9   2918587 155.9
Vcells 29961780 228.6   89587521 683.5 105198465 802.7

3 name_sufx_cd

name_sufx_cd Voter name suffix

This is intended for generation markers, e.g. Junior, Senior.

I am not going to use name suffix in entity resolution because age should be sufficient and is much better quality. I will look at what values turn up in the name suffix because the same values sometimes wrongly occur in the main name variables. Knowing what values occur may help us to remove those values from the main name variables.

d %>% dplyr::select(name_sufx_cd) %>% skimr::skim()
Table 3.1: Data summary
Name Piped data
Number of rows 4099699
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name_sufx_cd 3869063 0.06 1 3 0 101 0
table(d$name_sufx_cd, useNA = "ifany") %>% sort() %>% rev()

   <NA>      JR     III      SR      II      IV     JR.     SR.       I       V 
3869063  153804   29605   27494   14043    3682    1060     226     218     190 
    111     MRS      11      VI       `     VII     MR.     MS.       J       E 
     67      50      28      27      13       9       7       5       5       4 
     MR       C       W     SCO       S     REV       R       N       M      JD 
      3       3       2       2       2       2       2       2       2       2 
    DR.       D     ANN       0     (JR       X     WAL     VIR     TOB     Sr. 
      2       2       2       2       2       1       1       1       1       1 
    SMI     SAM     REE     RAY       Q     PLA       P      ON      OD       O 
      1       1       1       1       1       1       1       1       1       1 
     MS     MOO     MMO      MD     MCQ     MAC     LOC     LLL      LL     LEW 
      1       1       1       1       1       1       1       1       1       1 
    LEE     LAR       L     KIT     KEN       K     JR,     JAC     ING     ILI 
      1       1       1       1       1       1       1       1       1       1 
    II.       H     GUY     GLE       G     FOR     FAU     F M      EY     EWA 
      1       1       1       1       1       1       1       1       1       1 
    ELS     DOR      DO     DIC     CUB     CHA       B     ALB     AJR       A 
      1       1       1       1       1       1       1       1       1       1 
    8TH       5     3RD      39     346       2      1V      15     134     070 
      1       1       1       1       1       1       1       1       1       1 
     \\     (II 
      1       1 
# get a better look at the cleaned suffixes
d %>% 
  dplyr::mutate(
    sufx = name_sufx_cd %>% 
      stringr::str_to_upper() %>% 
      stringr::str_remove_all(pattern = "[^A-Z0-9]") %>% # remove non-alphanumeric
      dplyr::na_if("") 
  ) %>% 
  dplyr::count(sufx) %>% 
  dplyr::filter(n > 1) %>% 
  dplyr::arrange(desc(n), sufx) %>% 
  knitr::kable()
sufx n
NA 3869077
JR 154867
III 29605
SR 27721
II 14045
IV 3682
I 218
V 190
111 67
MRS 50
11 28
VI 27
MR 10
VII 9
MS 6
J 5
E 4
C 3
0 2
ANN 2
D 2
DR 2
JD 2
M 2
N 2
R 2
REV 2
S 2
SCO 2
W 2
  • There are generation suffixes: JR, SR, I, II (11), III (111), IV, V, VI, VII
  • There are honorific titles: MRS, MR, MS, DR, REV

4 Standardisation

Look at issues that might be addressed by standardisation.

For each type of standardisation issue look at first middle and last names separately, because the issue may manifest differently in each of the name variables.

4.1 Lower-case letters.

d %>% dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "[a-z]"))
# A tibble: 3 x 1
  last_name       
  <chr>           
1 MacQUEEN        
2 MacQUEEN        
3 BROWN-McCULLOUGH
d %>% dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "[a-z]"))
# A tibble: 11 x 1
   first_name
   <chr>     
 1 JoANN     
 2 LaVERNE   
 3 JoANNE    
 4 JoANN     
 5 SiROBERT  
 6 McCKINES  
 7 DeNEAL    
 8 McHILDIA  
 9 JoANN     
10 LaSONYA   
11 JeROME    
d %>% dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "[a-z]"))
# A tibble: 76 x 1
   midl_name  
   <chr>      
 1 McBRIDE    
 2 McBRIDE    
 3 McKINNIE   
 4 McLAWHORN  
 5 McKEITHAN  
 6 McCULLEN   
 7 MacFRANKLIN
 8 McQUEEN    
 9 McPHAIL    
10 McCULLEN   
# … with 66 more rows
  • Lower case letters occur in last, first, and middle names
  • Associated with particles where there would optionally be a space, e.g. JoANN, McBride

4.2 Non-alphanumeric

Check for non-alphanumeric characters in names.

4.2.1 Hyphen

Check for hyphens.

x <- d %>% 
  dplyr::filter(stringr::str_detect(last_name, "-"))

nrow(x)
[1] 20543
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
BAZAN-MANSON ANDREA NA NA FEMALE
BENITEZ-GRAHAM ANA NA NA FEMALE
BERRY-DANIEL VAUGHN NA NA FEMALE
BIBB-FREEMAN TIFFANY OCTAVIA NA FEMALE
COALE-KRUPA MARY KITTY NA FEMALE
EVERETT-GIGLIO SUZANNE MARIE NA FEMALE
HARRISON-LAMPTEY JAMES CHARLES NA MALE
JEAN-PIERRE HERBERT NA NA MALE
JENKINS-JAMES TREVA NA NA FEMALE
JOHNSON-DILLENBECK LINDA JEAN NA FEMALE
JOHNSON-FORBES SHAQUILLA NICOLE JOHNSON NA FEMALE
MACK-PURNELL JOYCE A NA FEMALE
MANNING-SHAUB CHERYL NA NA FEMALE
POLLARD-GREIF RHIANNON ELIZABETH NA FEMALE
RADFORD-BLACK ANITA NA NA FEMALE
RICHMOND-GRAVES VANESSA D NA FEMALE
SHERRELL-PATTERSON KRYSTAL ANN NA FEMALE
STONE-TANENBERG KAREN ANNE NA FEMALE
TILLEY-VARELA MYRA AMANDA NA FEMALE
WIMBISH-VANDERBECK LAURA NA NA FEMALE
  • ~21k last names with hyphens
  • Look like legitimately hyphenated last names
x <- d %>% 
  dplyr::filter(stringr::str_detect(first_name, "-"))

nrow(x)
[1] 3011
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(first_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
ROSAVAGE ANN-MARIE NA NA FEMALE
CRENSHAW CALLIE-ANNE DOANE NA FEMALE
GLENN CHIH-TZU L NA FEMALE
WOODARD ESTHER-JOAN SURRETT NA FEMALE
ROUSSEAUX JEAN-CLAUDE CHRISTIAN NA MALE
BERARD JEAN-PAUL NA NA MALE
ARTIS JO-ANN NA NA FEMALE
FURLONG JO-ANN ALICE NA FEMALE
DALE JON-MARC RYAN NA MALE
COOK KAWIKA-JAMAL SAMUEL NA MALE
MILLER LEE-JAMIL K NA MALE
BEAVER RUTH-ANNE GUST NA FEMALE
YEUNG SHIN-YIING NA NA FEMALE
JAN SHYI-TAI NA NA MALE
CHU TE-HSIN A NA FEMALE
SU TSUNG-HU NA NA FEMALE
MOJICA WILLIAM-JOSEPH KAILI NA MALE
TSAI WON-WHEI NA NA FEMALE
TAPP YOUNG-SUK O NA FEMALE
CHANG YU-JHI(JULIE) CHEN NA FEMALE
  • ~3kL first names with hyphens
  • Look like legitimately hyphenated first names
x <- d %>% 
  dplyr::filter(stringr::str_detect(midl_name, "-"))

nrow(x)
[1] 3883
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(midl_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
CABRIELE DEBRA ANN-MARIE NA FEMALE
CONRAD HEATHER CLEONA-JANE NA FEMALE
REEVES ANDREW DAVID-JOEL NA MALE
BROOKS DARRICK E-ALSTON NA MALE
EBRAHIM EMAD ELDIN-YASHAR NA MALE
WILSON KAY FRANCES-LAWS NA FEMALE
LANE MICHELLE GAYE-PRESTON NA FEMALE
YUAN DEREK HAW-LUEN NA MALE
PATE BARBARA JEAN-DALE NA FEMALE
CHAN GODWIN KWOK-YIN NA MALE
HUGHES RACHEL LYNN-INGRAM NA FEMALE
MORTON QUIANA MAISHA-ANN NA FEMALE
SAYE ROBYN MOO-YOUNG NA FEMALE
DIXON STANLEY RAY-HAMILTON NA MALE
PARKER BRANDON SHON-DAY NA MALE
CYRAN JACLYN SUZANNE-MARIE NA FEMALE
BOWMAN HELEN TAUSSIG-HAUPT NA FEMALE
BELL ROSE TIEH-CHIN NA FEMALE
COLDREN RUTH VIOLA-SHEATS NA FEMALE
KIM LEGIA YOUNG-SON NA FEMALE
  • ~4k middle names with hyphens
  • Look like legitimately hyphenated middle names

4.2.2 Quote

Check for quotes.

x <- d %>% 
  dplyr::filter(stringr::str_detect(last_name, "'"))

nrow(x)
[1] 4920
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
D’ARBEAU STEPHEN B NA MALE
I’ANSON-JACKSON JENNIFER NA NA FEMALE
O’BRIANT MABEL S NA FEMALE
O’BRIEN MARK S NA MALE
O’BRIEN PATRICK WAYNE SR MALE
O’BRIEN WILLIAM PATRICK NA MALE
O’CONNELL TINA DEE NA FEMALE
O’MALLEY-HELMS COLLEEN E NA FEMALE
O’MEARA MORGAN STUART NA MALE
O’NEAL DORIS TAYLOR NA FEMALE
O’NEAL KELLY NA NA FEMALE
O’NEAL BETTY MAGALENE D NA FEMALE
O’NEAL TAKINA L NA FEMALE
O’NEAL LINDA F NA FEMALE
O’NEAL CHARLES FRANKLIN NA MALE
O’NEIL DONNA LOUISE NA FEMALE
O’NEILL LUCILLE W NA FEMALE
O’QUINN VICKIE LEE NA FEMALE
O’ROURKE JOHN F NA MALE
O’ROURKE JEFFERY JAMES NA MALE
  • ~5k last names with quotes
  • Look like legitimately quoted last names
x <- d %>% 
  dplyr::filter(stringr::str_detect(first_name, "'"))

nrow(x)
[1] 1226
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(first_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
KNIGHT A’NDREA LANIER NA FEMALE
RICHARDSON ANDRE’ STEVEN NA MALE
CORBETT D’ANDREA PERE NA FEMALE
PASOUR D’ETTA TAYLOR NA FEMALE
JONES DEONTAE’ QUINN NA MALE
EDWARDS DESIRE’ DENISE NA FEMALE
WILKINS FAR’D HAKEEM NA MALE
RICHARDSON J’MAINE NMN NA MALE
ALSTON J’MIA KAE NA FEMALE
DUNLAP JA’TINA R NA FEMALE
SUITT L’TONYA NA NA FEMALE
LITTLEJOHN LA’KANYA MICHELLE NA FEMALE
HALL LA’KETTA CHENTAL NA FEMALE
DOWELL LA’TONYA YVETTE NA FEMALE
FORD O’DENA NA NA FEMALE
JACKSON O’NEIL NA NA MALE
HILBURN O’NEILL NA NA MALE
LEBEAU RENE’ DOMITIEN NA MALE
WEATHERBEE SADE’ SHANNON NA FEMALE
MITCHELL SHANTAE’ T NA FEMALE
  • ~1k first names with quotes
  • Look like legitimately quoted first names
x <- d %>% 
  dplyr::filter(stringr::str_detect(midl_name, "'"))

nrow(x)
[1] 3152
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(midl_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
GRILLO LUIS CHE’ NA MALE
CRAWFORD NICKOLAS D’ANDRE NA MALE
LARSEN HEATHER D’ANN NA FEMALE
SINGHATEH NICHELLE DY’VONNE NA FEMALE
HARPER DUNSEY LA’TAZE NA MALE
BATTLE IKEDA LE’RECIA NA FEMALE
O’CONNELL KAREN O’BRIEN NA FEMALE
SPERRY ANN O’BRIEN NA FEMALE
GINYARD DEEDRICK O’BRIEN NA MALE
ARNEY KATHLEEN O’DWYER NA FEMALE
DOWNES ANN O’HARA NA FEMALE
VANHOOK BRANDON O’NEAL NA MALE
JONES ROBIN O’NEIL NA FEMALE
JOHNSON DAESHAWAN O’NEIL NA MALE
MANNS RUSSELL O’NEIL JR MALE
CALLOWAY TOMIKA RENEE’ NA FEMALE
THOMAS DENA RENEE’ NA FEMALE
CALLAHAN PAMELA RENEE’ NA FEMALE
GRAHAM QUINDERIA SH’RON NA FEMALE
JAMES BRITTANY VONT’E NA FEMALE
  • ~3k middle names with quotes
  • Look like legitimately quoted middle names

4.2.3 Period

Check for periods.

x <- d %>% 
  dplyr::filter(stringr::str_detect(last_name, "\\."))

nrow(x)
[1] 11
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
BINGHAM JR. AMES EDMOND NA MALE
DAYE JR. JAMES NA JR MALE
RUSSELL, JR. KERMITT PATRICK NA MALE
ST. CLAIR JACK LEE NA MALE
ST. CYR CANDICE NICOLE NA FEMALE
ST. GEORGE MARTHA S NA FEMALE
ST. GERMAIN AMY NA NA FEMALE
ST. JOHN JESSICA JO NA FEMALE
ST. LAWRENCE ELIZABETH W NA FEMALE
ST.CLAIRE KEVIN WAYNE NA MALE
ST.JOHN JOANN DIMAGGIO NA FEMALE
  • 11 last names with periods
  • Look like legitimate abbreviations
x <- d %>% 
  dplyr::filter(stringr::str_detect(first_name, "\\."))

nrow(x)
[1] 120
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(first_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
NORRIS A.T. NA NA MALE
EVINS BETTY L. CURRIN NA FEMALE
BUIE BEVERLY D. COOKE NA FEMALE
DUNN C. SHAY HARRELSON NA FEMALE
SWOFFORD D. MILYNN NA FEMALE
ROSS E. TRAVIS JR MALE
INGOLFSSON E. JUANITA O’BRIEN NA FEMALE
LILLEY G. C. NA MALE
NAVARRE J. RICHARD II MALE
JARRETT J. REID NA MALE
AINSLEY J. (JULIUS) T.(THOMAS) NA MALE
RENDLEMAN J.T. NA NA MALE
GIBSON M. COLINE NA FEMALE
HICKS MARY E. PALMER NA FEMALE
UNDERWOOD NORMA J. PHILLIPS NA FEMALE
GARSKA P.J. JAN DE BEWR NA FEMALE
USSERY PRISCILLA B. SANDERS NA FEMALE
BOUCHER T. RENEE NA NA FEMALE
GABLE THOMAS J. WESTLEY NA MALE
MUSSON W. JAMES NA MALE
  • 120 first names with periods
  • Look like initials
x <- d %>% 
  dplyr::filter(stringr::str_detect(midl_name, "\\."))

nrow(x)
[1] 2233
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(midl_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
KYKER JACOB C. NA MALE
TOWNSEND IRIS D. MELENDEZ NA FEMALE
FLOYD WALTER E. NA MALE
WARMACK WILLIAM G. NA MALE
MASON PEGGY H. NA FEMALE
ENOCH WILLIAM H. NA MALE
MITCHELL JOHN J. NA MALE
HAMILTON MILON J. NA MALE
PARROTT ULYSSES J. JR MALE
NEWBLE ANDRE L.K. NA MALE
LENAHAN LOIS M. NA FEMALE
JONES TANETTA M. NA FEMALE
PHILLIPS HELEN M. NA FEMALE
MOORE DOROTHY M. NA FEMALE
EDWARDS STANLEY M. NA MALE
BROOKS MARILYN M.LEDFORD NA FEMALE
WAVERLY TRACY R. NA FEMALE
SCOTT HENRY R. NA MALE
GRAHAM KENDRA T. NA FEMALE
ESPERGREN MARY T. NA FEMALE
  • ~2k middle names with periods
  • Look like initials

4.2.4 Comma

Check for commas.

x <- d %>% 
  dplyr::filter(stringr::str_detect(last_name, ","))

nrow(x)
[1] 2
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
FILLINGHAM, II ROBERT E NA MALE
RUSSELL, JR. KERMITT PATRICK NA MALE
  • 2 last names with commas
  • Punctuation for suffix field values added to last name
x <- d %>% 
  dplyr::filter(stringr::str_detect(first_name, ","))

nrow(x)
[1] 4
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(first_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
PHILLIPS FRANK, NA JR MALE
HICKS MARION, NA SR MALE
CANIPE NOAH, NA JR MALE
MCADAMS WILL,JR NA NA MALE
  • 4 first names with commas
  • Arbitrary added punctuation
  • Punctuation for suffix field value added to first name
x <- d %>% 
  dplyr::filter(stringr::str_detect(midl_name, ","))

nrow(x)
[1] 12
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(midl_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
FAUCETTE JESSE EDWARD, J NA MALE
BRASWELL ROBERT ELLIS, J NA MALE
MARTIN LLOYD FRANKLIN, S NA MALE
GAY ROBERT HENRY, III. NA MALE
FERGUSON STANTON HYDE, J NA MALE
CLARK COLEMAN JACKSON, I NA MALE
BARNES RUSSELL JOSEPH, J NA MALE
PIERCE RUTH P, NA FEMALE
COVINGTON EDNA(MRS PERRY, JR) NA FEMALE
SCARBOROUGH JOHN R, NA MALE
SHEARIN ANDREW THOMAS, S NA MALE
WILLIAMS ERVIN W., SR., NA MALE
  • 12 middle names with periods
  • List separator
  • Punctuation to squeeze in extra field

4.2.5 Other non-alphanumeric

Check for other non-alphanumeric characters.

x <- d %>% 
  dplyr::filter(stringr::str_detect(last_name, "[^ a-zA-Z0-9\\.,'-]"))

nrow(x)
[1] 31
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, sex) # %>% 
# A tibble: 20 x 5
   last_name           first_name midl_name name_sufx_cd sex   
   <chr>               <chr>      <chr>     <chr>        <chr> 
 1 "BOYLES`"           LINDA      BROWN     <NA>         FEMALE
 2 "BRYANT`"           WILLIAM    STEWART   <NA>         MALE  
 3 "COLLINS/SISK"      RHONDA     L         <NA>         FEMALE
 4 "D*AMICO"           PATRICIA   MARIE     <NA>         FEMALE
 5 "D*AMICO"           MEGAN      MARIE     <NA>         FEMALE
 6 "GALINSKY/MALAGUTI" DANA       ANNE      <NA>         FEMALE
 7 "GOSHEN\\"          DIXIE      M         <NA>         FEMALE
 8 "LA\"BEE"           DELACRUZ   <NA>      <NA>         FEMALE
 9 "MARTIN/HUFF"       ELLEN      MARIE     <NA>         FEMALE
10 "MORRISON`"         HAZEL      M         <NA>         FEMALE
11 "NICHOLS/BROWN"     MARY       SUE       <NA>         FEMALE
12 "O*BRIEN"           COLIN      JAMES     <NA>         MALE  
13 "O*TOOLE"           PETER      TERRENCE  <NA>         MALE  
14 "REAVIS/LONG"       SHAWN      MICHELLE  <NA>         FEMALE
15 "RHONEY/PETERS"     DONNA      <NA>      <NA>         FEMALE
16 "SCHERM%MARTIN"     WYATT      <NA>      <NA>         FEMALE
17 "STRTHEIT\\"        LOLA       C         <NA>         FEMALE
18 "TALBERT/GRAHAM"    BRENDA     <NA>      <NA>         FEMALE
19 "TUCKER/JACKSON"    LAVONDA    LYNN      <NA>         FEMALE
20 "WOODARD`"          JASON      WARREN    <NA>         MALE  
  # knitr::kable() # some of the characters break the kable formatting
  • 31 last names with other non-alphanumeric characters
  • Most look like substitutions for hyphen or quote
  • Some look like random cruft
x <- d %>% 
  dplyr::filter(stringr::str_detect(first_name, "[^ a-zA-Z0-9\\.,'-]"))

nrow(x)
[1] 102
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(first_name, sex) # %>% 
# A tibble: 20 x 5
   last_name first_name         midl_name name_sufx_cd sex   
   <chr>     <chr>              <chr>     <chr>        <chr> 
 1 POTEAT    "(KAY)"            ANNE CATH <NA>         FEMALE
 2 FIELDS    "ADRIENNE`FELICIA" <NA>      <NA>         FEMALE
 3 STEELE    "AR`KISHA"         FERNESE   <NA>         FEMALE
 4 JACKSON   "AR`MONIE"         <NA>      <NA>         FEMALE
 5 STUBBS    "BRITNE`"          ELIZABETH <NA>         FEMALE
 6 CLARK     "CANDERE`"         L         <NA>         FEMALE
 7 SELF      "CATHERINE`"       MARIE     <NA>         FEMALE
 8 FABIAN    "D`ARLINE"         D         <NA>         FEMALE
 9 INGRAM    "D`WON"            LAMONTE   <NA>         MALE  
10 NICHOLS   "DORIS ( MRS W"    <NA>      <NA>         FEMALE
11 STEWART   "JA`VONDA"         NICHOLE   <NA>         FEMALE
12 SPENCER   "JAMES (JIM)"      N         <NA>         MALE  
13 CARTER    "JOSE`"            PIERRE    <NA>         MALE  
14 HEMPHILL  "LA`CHERICA"       EVON      <NA>         FEMALE
15 CHESTNUT  "LA`WANDA"         F         <NA>         FEMALE
16 DUNN      "MARY (\"PETE\")"  BURNETTE  <NA>         FEMALE
17 JERNOVICS "MARY SUSAN/"      R         <NA>         FEMALE
18 KERN      "O  (BUDDY)"       R         <NA>         MALE  
19 FOSTER    "OTIS(NMN)"        JR        <NA>         MALE  
20 DAVIDOV   "ZVIYA`CRYSTAL"    <NA>      <NA>         FEMALE
  # knitr::kable() # some of the characters break the kable formatting
  • 102 first names with other non-alphanumeric characters
  • Some look like substitutions for hyphen or quote
  • Some are parenthetical notes
  • Some look like random cruft
x <- d %>% 
  dplyr::filter(stringr::str_detect(midl_name, "[^ a-zA-Z0-9\\.,'-]"))

nrow(x)
[1] 1097
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(midl_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
KEENE JOSEPHINE (MRS OTIS) NA FEMALE
PIERMARINI JACLYN (NMN) NA FEMALE
TORAIN ROOSEVELT (NMN) JR MALE
WILEY ROOSEVELT (NMN) JR MALE
HARVEY NATHANIEL (NMN) NA MALE
EULISS MAX (NMN) NA MALE
ASHE WILLARD (NMN) NA MALE
CAMERON CHARLES (NMN) JR MALE
CRUTCHFIELD WAYNE (NMN) NA MALE
ROBERTS ROBIN (NMN) NA MALE
COHEN SETH (NMN) NA MALE
DEMAS DOLORIS A/GEARHART NA MALE
MUSSELWHITE ALLISON ELAINE/HUMPH NA FEMALE
BASS H J (HUBERT) NA MALE
BRITT ANGELA KAY / ROGERS NA FEMALE
LOCKLEAR MINNIE LEE/JONES NA FEMALE
DREW VIRGINIA M/KELLEY NA FEMALE
BROOKS WILLIAM MACK (BILL) NA MALE
FRIEDRICH DELLA MAE /KEYS NA FEMALE
ALSTON MEI WAN/DAN NA FEMALE
  • ~1k middle names with other non-alphanumeric characters
  • Some look like substitutions for hyphen
  • Many are parenthetical notes (NMN = no middle name)

4.3 Digits

Check for digits.

4.3.1 Zero

Check for zero

x <- d %>% 
  dplyr::filter(stringr::str_detect(last_name, "0"))

nrow(x)
[1] 29
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
ALEM0N NOE A NA MALE
BOLAD0 PAULA HUTCHENS NA FEMALE
CAPUT0 BARBARA DAVIS NA FEMALE
CONR0Y WILLIAM COURTNEY NA MALE
D0WNS MARIO ENRICO NA MALE
EAT0N VICKIE TUGGLE NA FEMALE
ESC0BEDO AUDREY ANN NA FEMALE
FERNANDEZ-BRAV0 GIOVANNI NA NA MALE
GUARDAD0 MANUEL FELIX NA MALE
J0HNSON LUCILLE FRANCES NA FEMALE
JOHNS0N MICHAEL NA NA MALE
MCD0UGAL BETTY JEAN NA FEMALE
OCONN0R GERALDINE LOUISE NA FEMALE
PEREZ-NAVARR0 CAROLE SHAY NA FEMALE
R0CCO CHRISTOPHER NA NA MALE
REYN0LDS ADAM DANIEL NA MALE
RUSS0 ANGEL MARIE NA FEMALE
SIMPS0N MARY ANN NA FEMALE
WIT0SKY MICHAEL ADAM NA MALE
YATSK0 JEANETTE MARIE NA FEMALE
  • 29 last names with zero
  • Substitution for O
x <- d %>% 
  dplyr::filter(stringr::str_detect(first_name, "0"))

nrow(x)
[1] 33
x %>%   
  dplyr::slice_sample(n = 20) %>%
  dplyr::arrange(first_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
PETTY ALLIS0N JEAN NA FEMALE
BROWN C0LBY TODD NA MALE
COOPER C0RDELIA P NA FEMALE
WHITTEMORE D0LORES H NA FEMALE
LOWE D0NNA G NA FEMALE
BRANDT J0HN C NA MALE
CRANFILL J0HN NA NA MALE
ADAMS J0HN WILLIAMS NA MALE
TANNAHILL J0SEPH ERIC NA MALE
WILLIAMS M0NIKA UDANA NA FEMALE
KEENAN MARY-J0 NA NA FEMALE
SHEPHERD OTH0 L NA MALE
THOMAS P0LLY BROWN NA FEMALE
RAMIREZ REYNALD0 G NA MALE
BUIE S0NTE Y NA FEMALE
MITCHELL SHANN0N ARLINE NA FEMALE
JOHNSON T0NYA BETH NA FEMALE
MONK T0NYA SIVLEY NA FEMALE
RUFFIN TIM0THY ONEILL NA MALE
KENNEDY V0NCIEAL LEE NA FEMALE
  • 33 first names with zero
  • Substitution for O
x <- d %>% 
  dplyr::filter(stringr::str_detect(midl_name, "0"))

nrow(x)
[1] 77
x %>%   
  dplyr::slice_sample(n = 20) %>%
  dplyr::arrange(midl_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
PONGPAIROJ AMANDA 0 NA FEMALE
IVESTER WILLIAM 0DELL NA MALE
MOORE EVA 0MAE NA FEMALE
LANE KATHLEEN 0VERTON NA FEMALE
SODAGAR EASA 2205 NA MALE
FRENCH SHNETTA ALEXANDER080572 NA FEMALE
NEWSOME MARK ANTH0NY NA MALE
BRODIE WILLIAM C1010 NA MALE
SMITH BRODY CO0PER NA MALE
OROPEZA AMILCAR COL0N NA MALE
LUCK GENA DON0HOO NA FEMALE
STOLLBRINK KATHY J0 NA FEMALE
NAYLOR ANGELA LY0NS NA FEMALE
MCKOY LILLY M00RE NA FEMALE
JONES RASHAWN M0NIQUE NA FEMALE
MARSHALL MONICA NICH0LE NA FEMALE
DAVIS LEANN RUNY0N NA FEMALE
HOLTON RANDY SC0TT NA MALE
NATION LAVONIA V0SS NA FEMALE
HASH MYRTLE Y0UNG NA FEMALE
  • 77 middle names with zero
  • Some are substitution for O
  • Some are in superfluous numbers

4.3.2 One

Check for one.

x <- d %>% 
  dplyr::filter(stringr::str_detect(last_name, "1"))

nrow(x)
[1] 1
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
SATTERFIELD 111 CHARLES MASON NA MALE
  • 1 last name with one
  • Substitution for I in generation suffix (111 = III)
x <- d %>% 
  dplyr::filter(stringr::str_detect(first_name, "1"))

nrow(x)
[1] 0
  • 0 first names with one
x <- d %>% 
  dplyr::filter(stringr::str_detect(midl_name, "1"))

nrow(x)
[1] 39
x %>%   
  dplyr::slice_sample(n = 20) %>%
  dplyr::arrange(midl_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
PATTERSON CARL 11 NA MALE
ADAMS RALPH 11 NA MALE
REED CHARLES 11 NA MALE
WILLIAMS JOSEPH 11 NA MALE
BEST KENNETH 111 NA MALE
LOPEZ CARLOS 111 NA MALE
QUERY FRED 111 NA MALE
FEATHERSTONE GEORGE 111 NA MALE
FREEZE HOMER 111 NA MALE
KLUTTZ JOE 111 NA MALE
MCGOVERN WILLIAM 111 NA MALE
WINECOFF DAVID 111 NA MALE
JOHNSON ULUS 111 NA MALE
COOKE GEORGE 111 NA MALE
HOWERIN MICHAEL DALE401 NA MALE
HUNTER MORDECAI J1-TO NA MALE
FAICLOTH TIMOTHY LOUIS7100 NA MALE
BREEN TERRANCE MICHAEL146 NA MALE
PATTERSON CARLA NADINE DOUGLAS1 NA FEMALE
PLESS JOAN WRIGHT2106 NA FEMALE
  • 39 middle names with one
  • Some are substitution for I in generation suffix
  • Some are in superfluous numbers

4.3.3 Other digits

Check for other digits.

x <- d %>% 
  dplyr::filter(stringr::str_detect(last_name, "[2-9]"))

nrow(x)
[1] 1
x %>%   
  dplyr::slice_sample(n = 20) %>% 
  dplyr::arrange(last_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
ALBER5TSON BASIL ERVIN NA MALE
  • 1 last name with a 5
  • Substitution of 5 for S
x <- d %>% 
  dplyr::filter(stringr::str_detect(first_name, "[2-9]"))

nrow(x)
[1] 2
x %>%   
  dplyr::slice_sample(n = 20) %>%
  dplyr::arrange(first_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
SPIVEY FR4ANK THOMAS SR MALE
CHILTON J8IMMIE HERBERT NA MALE
  • 2 first names with digits 2-9
  • Look like random insertions
x <- d %>% 
  dplyr::filter(stringr::str_detect(midl_name, "[2-9]"))

nrow(x)
[1] 24
x %>%   
  dplyr::slice_sample(n = 20) %>%
  dplyr::arrange(midl_name, sex) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex
SODAGAR EASA 2205 NA MALE
YOUNG WANWYNE 4625 NA FEMALE
CLARKE MINERVA 4932 NA FEMALE
PHAIR IDELL 8017 NA FEMALE
FRENCH SHNETTA ALEXANDER080572 NA FEMALE
BEACHAM HEATHER ANDERSON9104576 NA FEMALE
SHUMAKER RUTH ANN BURTON47 NA FEMALE
KOERNER JENNIFER ANN155 NA FEMALE
WARD EVA B2957 NA FEMALE
HOWERIN MICHAEL DALE401 NA MALE
GLOVER DIONNE LYNN1820 NA FEMALE
GUIDO DEANA LYNN2513 NA FEMALE
BECHTEL TERESA MARIE103062 NA FEMALE
BREEN TERRANCE MICHAEL146 NA MALE
HILL ZEB MITCHELL368 NA MALE
BLAIR ESSIE MIZELLE25248249 NA FEMALE
PERKINS TERESA ROSENBAUM3305 NA FEMALE
TOOMES BRIAN SCOTT3450 NA MALE
PYRTLE PHILLIP W5RAY SR MALE
PLESS JOAN WRIGHT2106 NA FEMALE
  • 24 middle names with digits 2-9
  • One random insertion
  • Most appear to be superfluous numbers (from the address?)

4.4 Special words

Look for special words that shouldn’t be in names.

Define word patterns to search for.

# honorifics
w_hons <- c(
  "MR", "MISTER", "MASTER", "MRS", "MS", "MISS", 
  "REV", "REVEREND", "SR", "SISTER", "BR", "BROTHER",
  "FATHER", "MOTHER", "PASTOR", "ELDER", "BISHOP",
  "DR", "DOCTOR", "MD", "PROF", "PROFESSOR"
)

# generation suffixes
w_gen <- c(
  "JR", "JNR", "JUNIOR", "SR", "SNR", "SENIOR",
  "1ST", "2ND", "3RD", "4TH", "5TH", "6TH", "7TH", "8TH",
  "FIRST", "SECOND", "THIRD", "FOURTH", "FIFTH", "SIXTH", "SEVENTH", "EIGHTH", "EIGHTTH",
  "1", "2", "3", "4", "5", "6", "7", "8",
  "I", "II", "III", "IIII", "IV", "V", "VI"
)

# special values
w_spec <- c(
  "NN", "NMN", "NAME",
  "UNK", "UNKNOWN", "AKA", "KNOWN AS", "ALSO KNOWN AS", "ALIAS",
  "BLIND"
)

# test
w_test <- c(
  "TEST", "TST", "DUMMY", "VOTER",  "([A-Z])\\1{2,}"
)

4.4.1 Last name

# regular expression to match words
w_regexp <- 
  c(w_hons, w_gen, w_spec, w_test) %>% # all special words
  unique() %>% # make it a set
  dplyr::setdiff( # remove words that appear to mostly be validly used
    c(
      "BISHOP",
      "BLIND",
      "BROTHER",
      "DOCTOR",
      "ELDER",
      "FIRST",
      "JUNIOR",
      "MASTER",
      "MISS",
      "MISTER",
      "PASTOR",
      "SENIOR",
      "TEST",
      "THIRD",
      "VOTER"
    )
  ) %>% 
  glue::glue(x = . , "\\b{x}\\b") %>%  # must be words
  glue::glue_collapse(sep = "|") # search for any

x <- d %>% 
  dplyr::mutate(
    match = 
      last_name %>% 
      stringr::str_to_upper() %>% 
      stringr::str_replace_all(pattern = "[^ A-Z]", replacement = " ") %>% 
      stringr::str_squish() %>% 
      stringr::str_extract(pattern = w_regexp)
  ) %>% 
  dplyr::filter(!is.na(match))

nrow(x)
[1] 124
x %>% 
  dplyr::arrange(match, sex, last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex match
WILLIAMSON DR IRVIN D NA MALE DR
I’ANSON-JACKSON JENNIFER NA NA FEMALE I
BIREN II WILLIAM GEORGE NA MALE II
CRITTENDON II WILLIAM BURRELL NA MALE II
EVANS II DONALD M NA MALE II
FILLINGHAM, II ROBERT E NA MALE II
GOODWIN II PAUL J NA MALE II
GREEN II BILLY HOWARD NA MALE II
METTS II CAREY MONTGOMERY NA MALE II
MILHORN II JOSEPH JAMES NA MALE II
PERSON II DARYEL JAMES NA MALE II
SEABOLD II GERALD W NA MALE II
STANLEY II WILLIAM A NA MALE II
TAYLOR II ROBERT D NA MALE II
THOMBS II DANIEL EUGENE NA MALE II
WATSON II ROBERT NATHANIEL NA MALE II
WORD II JOE NAHAN NA MALE II
AUSLEY III PRESTON ALEXANDER NA MALE III
BEATTY III CURTIS M NA MALE III
BLACKWELDER III DWIGHT MCNAIRY NA MALE III
BOONE III JAMES HENRY NA MALE III
BOSQUEZ III RICHARD NA NA MALE III
CHAPPELL III TRAVIS NA NA MALE III
COCKERHAM III BOBBY LEE NA MALE III
CONNELL III THOMAS JOSEPH NA MALE III
FAULKNER III HOWARD VERNON NA MALE III
GOODWIN III WARD ALEXANDER NA MALE III
GROUSE III CHARLES J NA MALE III
HARRIS III WILLIAM T NA MALE III
KNOX III JOHN J NA MALE III
LANE III WILLIAM JAMES NA MALE III
MCGUIRT III JAMES WILLIAM NA MALE III
MILLER III JOHNNIE H NA MALE III
MOORE III JAMES P NA MALE III
NEWSOME III THOMAS LESLIE NA MALE III
PEACOCK III EDWARD JACKSON NA MALE III
PETERS III MARION HOWELL NA MALE III
PRUDEN III THOMAS EUGENE NA MALE III
REDFEARN III WILBERT NA NA MALE III
SMITH III GUY R NA MALE III
THOMPSON III EMERY NA NA MALE III
BAKER IIII WILLAIM RAINEY NA MALE IIII
BUXTON IV SAMUEL R NA MALE IV
LONG IV FLOYD M NA MALE IV
THOMPSON IV HARRY M NA MALE IV
ANSELMENT JR JOSEPH LEONARD NA MALE JR
BALL JR SAMUEL LEE NA MALE JR
BARKLEY JR CHARLES W NA MALE JR
BENDER JR JOHN JOHN P NA MALE JR
BINGHAM JR. AMES EDMOND NA MALE JR
BIRCHFIELD JR MILBURN JOEL NA MALE JR
BLEDSOE JR HOMER BLAINE NA MALE JR
BROWN JR ROBERT A NA MALE JR
BUNDESMAN JR BERNARD B NA MALE JR
BYRD JR HERBERT L NA MALE JR
CAIL JR MALCOLM LEHOLMES NA MALE JR
CARRIER JR ROBERT WILSON NA MALE JR
CHAMBERS JR KENNETH RAY NA MALE JR
CHARLES JR WILLIE J NA MALE JR
CLAY JR WILEY WALTON JR MALE JR
CLAYTON JR JAMES D NA MALE JR
CULBRETH JR WALTER E NA MALE JR
DAYE JR. JAMES NA JR MALE JR
ENGLISH JR WARREN ROBERT NA MALE JR
EVANS JR RALPH NA II MALE JR
FAILLE JR EDWARD J NA MALE JR
FARMER JR BENJAMIN STEVE NA MALE JR
FRAZIER JR JAMES A NA MALE JR
GARCIA JR FRANK NA NA MALE JR
HALL JR JAMES B NA MALE JR
HARDIN JR CHARLES ELMORE NA MALE JR
HARGRAVES JR JAMES CALVIN NA MALE JR
HARRIS JR CHAMP NA NA MALE JR
HAWKINS JR REED GREGORY NA MALE JR
HENSLEY JR LAWRENCE G NA MALE JR
HERNDON JR EVERETT GEORGE NA MALE JR
HILL JR JAMES C NA MALE JR
HOYLE JR GEORGE A NA MALE JR
HUMPHRIES JR DONNIE R NA MALE JR
KENNEDY JR THOMAS E NA MALE JR
KUBU JR JERRY JOHN NA MALE JR
LANE JR DAVID C NA MALE JR
LAWRENCE JR HARRY NA NA MALE JR
MARBLE JR ROBERT STERLING NA MALE JR
MCCLURE JR DONALD R NA MALE JR
MCGUIRE JR JOHN M NA MALE JR
MONGIOVI JR ANTHONY B NA MALE JR
MOORE JR HARRY GRADY NA MALE JR
MORRISON JR WILLIAM EMERSON NA MALE JR
MOSES JR MICHAEL WILLIAM NA MALE JR
NASIFE JR SAMUEL NICHOLAS NA MALE JR
OUTLAND JR HOWARD BROWN NA MALE JR
OVERTON JR ROBERT ALLEN NA MALE JR
PARKS JR JOEL TIMOTHY NA MALE JR
PULSIFER JR HAROLD WINFRED NA MALE JR
REED JR BRUCE HAL NA MALE JR
ROBERTS JR GEORGE MARION NA MALE JR
RUSSELL, JR. KERMITT PATRICK NA MALE JR
SHADE JR EVERETTE LEE NA MALE JR
SHEALLY JR WILLIAM B NA MALE JR
ST JEAN JR JOSEPH NA NA MALE JR
STANSBERRY JR DAVID R NA MALE JR
STREETER JR THOMAS EARL NA MALE JR
VAN DOREN JR EDWARD FOSTER NA MALE JR
WHITEHOUSE JR JOHN JOSEPH NA MALE JR
WHITFIELD JR RAYMOND E NA MALE JR
WIEGOLD JR RICHARD MARTIN NA MALE JR
WILLIAMSON JR SOLOMAN J NA MALE JR
YOAKUM JR JC NA NA MALE JR
SMITH MD PATRICIA ANN NA FEMALE MD
VAN NAME MARY A NA FEMALE NAME
VAN NAME NANCY HIGGINS NA FEMALE NAME
VAN NAME CHRISTOPHER PAUL NA MALE NAME
VAN NAME GARY GEORGE NA MALE NAME
VAN NAME MARK L NA MALE NAME
BRAKE SR ESS CAROLYN G NA FEMALE SR
DOSS SR MICHAEL RAY NA MALE SR
HICKS SR WILFORD LYTLE SR. MALE SR
STIMSON SR RICHARD BARRETT NA MALE SR
VAUGHN SR WALTER S NA MALE SR
WHITWORTH SR RANDY SEAN NA MALE SR
V’SOSKE ERIKA DONNELL NA FEMALE V
MOODY V WILLIE HOLMES NA MALE V
TENNENT V EDWARD S NA MALE V

I eyeballed the results and removed words which appeared to be mostly validly used.

Invalid words:

  • As whole field:
  • As first word:
  • As last word: DR, II, III, IIII, IV, JR, MD, SR
  • As internal word: SR

4.4.2 First name

# regular expression to match words
w_regexp <- 
  c(w_hons, w_gen, w_spec, w_test) %>% # all special words
  unique() %>% # make it a set
  dplyr::setdiff( # remove words that appear to mostly be validly used
    c(
      "BISHOP",
      "BROTHER",
      "DOCTOR",
      "ELDER",
      "JUNIOR",
      "MASTER",
      "MISTER",
      "PASTOR",
      "PROFESSOR"
    )
  ) %>% 
  glue::glue(x = . , "\\b{x}\\b") %>%  # must be words
  glue::glue_collapse(sep = "|") # search for any

x <- d %>% 
  dplyr::mutate(
    match = 
      first_name %>% 
      stringr::str_to_upper() %>% 
      stringr::str_replace_all(pattern = "[^ A-Z]", replacement = " ") %>% 
      stringr::str_squish() %>% 
      stringr::str_extract(pattern = w_regexp)
  ) %>% 
  dplyr::filter(!is.na(match))

nrow(x)
[1] 328
x %>% 
  dplyr::arrange(match, sex, last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex match
AAL-ANUBIAIMHOTE DR NGOZI NA NA FEMALE DR
WILMES FATHER JAMES NA MALE FATHER
ARMOUR I ELISABETH NA NA FEMALE I
BOYKIN-PRIDE I’MAN BRIANN NA FEMALE I
BRADLEY I-ASIA VICTORIA-CHERIS NA FEMALE I
BRITTON I-CHI GUO NA FEMALE I
BROOME I SONYA TIERRA NA FEMALE I
BULLARD NANCY I W NA FEMALE I
CARLYLE I E NA FEMALE I
CARTER I JEANNETTE GUILBE NA FEMALE I
CHANG I-WEN NA NA FEMALE I
COLEMAN JANA’I D NA FEMALE I
CSEH MON I WANG NA FEMALE I
DESROSIERS I DARLENE NA FEMALE I
DOSHI I NA NA FEMALE I
ERVIN I MEI CHOU NA FEMALE I
GAYE I COLEEN M NA FEMALE I
GLASPIE I CHARLOTTE NA FEMALE I
GREEN I A NA FEMALE I
HALL I’MESHA L NA FEMALE I
HEYWARD I CARTER NA FEMALE I
HU EDNA I JEN NA FEMALE I
HUNEYCUTT I SUZANNE EUDY NA FEMALE I
JAN I-RAN HO NA FEMALE I
KUEHR-MCLAREN I WENDY NA FEMALE I
LANE I E MRS NA FEMALE I
LEWIS LEISA I OITERONG NA FEMALE I
MARTIN I MARY NA FEMALE I
MENG CHENG-I C NA FEMALE I
MOORING I’RISHA ORCHA’ NA FEMALE I
MORRIS I LANE NA NA FEMALE I
MULLIS LISA I BELL NA FEMALE I
NEAL I M NA FEMALE I
PERRY I SUN NA FEMALE I
POPE I-ASIA COX NA FEMALE I
ROSS I G MRS NA FEMALE I
SAUNDERS VICKI I SUTTON NA FEMALE I
SHERWOOD I-LI BETH NA FEMALE I
SIMMONS I-EESHA D NA FEMALE I
SIU I-MEI NA NA FEMALE I
SOUTHERLAND I KATHLEEN NA FEMALE I
SUMMEY I V NA FEMALE I
SUTARIA DEBORAH I S NA FEMALE I
TAI CHIH-I NA NA FEMALE I
TUTTLE I BESSIE NA FEMALE I
WASHINGTON CAROLINE I HALEY NA FEMALE I
WEIR I SUN NA FEMALE I
WILEY I’AIESHA SHANTEA NA FEMALE I
WOOD I F JR FEMALE I
ARNOLD I B NA MALE I
BATES I C NA MALE I
BREWER I V NA MALE I
CALDWELL I M NA MALE I
CHEO I-DEH NA NA MALE I
CLARKE I MITCHELL NA MALE I
COLLEY I D NA MALE I
DOWNS I V NA NA MALE I
EDWARDS I J JR MALE I
FERGUSON I M JR MALE I
FU I-KONG BATOR NA MALE I
GORDON I BRYCE NA MALE I
GUNTER I W NA MALE I
HAIG I REID S NA MALE I
HICKS I FAISON NA NA MALE I
HINES I ALAN NA MALE I
HOOD I G NA MALE I
HOWARD I CLARENCE NA MALE I
JENKINS I D III MALE I
JOHNSON I M NA MALE I
JOHNSTON I C NA MALE I
KELLY I PERRY NA MALE I
KINLAW I W NA MALE I
LAKE I BEVERLY JR MALE I
LITTLE I MAYO JR MALE I
LONGMUIR I S NA MALE I
LYONS I CHARLES NA MALE I
MANESS I M NA MALE I
MCNEIL I J NA MALE I
MILLER I J NA MALE I
MILLER I D MCGILVRAY NA MALE I
PALMER I JEREMIAH NA MALE I
PATTERSON I EUGENE NA MALE I
PAUL I B NA MALE I
PLYLER I F JR MALE I
POPE I H JR MALE I
POWELL I HILL NA MALE I
POWELL STEVEN I VANROOY NA MALE I
QUINN I J NA MALE I
QUINN I J JR MALE I
RUSS I V NA MALE I
SMITH I BRUCE NA MALE I
SMITH I MELVIN NA MALE I
SOLOMON I S NA MALE I
STONE I L NA MALE I
TERRY I B III MALE I
TRAVIS I A NA MALE I
WAKEFIELD I NELSON NA MALE I
WALLACE I J NA MALE I
WARREN I NA NA MALE I
WU I-CHAN JOHN NA MALE I
MANUEL WALTER III NA NA MALE III
MCPHERSON VAN III NA NA MALE III
NASH SAMUEL III NA NA MALE III
PATALANO LOUIS III NA NA MALE III
SCOTT CALVIN III NA NA MALE III
SILVER III HAYDEN NA MALE III
COPELAND IV EDWARD JAMES NA MALE IV
ANDERSON ELBERT JR NA NA MALE JR
BARNEY LEO JR NA NA MALE JR
BOWLES ROBERT JR NA NA MALE JR
BRYANT FREDDIE JR NA NA MALE JR
COLLINS JACK JR NA NA MALE JR
DARRELL JAMES JR NA NA MALE JR
DAVIS HENRY JR NA NA MALE JR
GERTZ JR RICHARD NA MALE JR
HOAGLAND JR SANDY NA MALE JR
HOLLEY JR JOHN MARSHAL NA MALE JR
JONES JR MICHAEL NA MALE JR
JOYNER JR EARNEST NA MALE JR
MCADAMS WILL,JR NA NA MALE JR
MCCLELLAND ERNEST JR NA NA MALE JR
MCCOY JR RICHARD TUNN NA MALE JR
MCIVER SIM JR NA NA MALE JR
MCLEOD WILLIE JR NA NA MALE JR
MULL MADISON JR NA NA MALE JR
PALMS DONALD JR NA NA MALE JR
PEOPLES LONZO JR NA NA MALE JR
ROSADO ALEJANDRO JR NA NA MALE JR
THOMPSON JOSEPHUS JR NA NA MALE JR
TILLMAN BENNIE JR NA NA MALE JR
TOOLE JR NA NA MALE JR
WOODS HOUSTON JR NA NA MALE JR
STOCKELL MD COOPER III MALE MD
SPEIGHT MISS STEPHANI RENEE’ NA FEMALE MISS
FATE MR NA NA MALE MR
KANE MR NA NA MALE MR
BECK MRS WILLIAM E NA FEMALE MRS
BINGMAN GRAY MRS NA NA FEMALE MRS
BURKE MRS GEORGE W NA FEMALE MRS
CARTER PAUL MRS NA JR FEMALE MRS
CHATMAN MRS H L NA FEMALE MRS
COVINGTON EDNA(MRS PERRY, JR) NA FEMALE MRS
CROMER BETTY MRS A NA FEMALE MRS
DAVENPORT MRS H T NA FEMALE MRS
DODSON RAY MRS NA NA FEMALE MRS
EATON MRS JOHN C NA FEMALE MRS
ESTES ALMA MRS A NA FEMALE MRS
FIELDS MRS G CLINTON NA FEMALE MRS
FIELDS MRS JAMES C NA FEMALE MRS
FULP JAMES MRS C NA FEMALE MRS
GIBSON H MRS L NA FEMALE MRS
GOOLSBY EUGENE MRS NA NA FEMALE MRS
GURGANIOUS JOHN MRS HALLIE NA FEMALE MRS
HAMRICK JOHN R MRS MARGARET NA FEMALE MRS
HARRIS MRS FRED W NA FEMALE MRS
HARRIS MRS P D NA FEMALE MRS
HARRIS MRS WILLIAM W NA FEMALE MRS
HARTIS FRANK E MRS THAMES NA FEMALE MRS
HOLLIDAY MRS JOSEPH NA NA FEMALE MRS
JEFFERSON MRS ATHOL G NA FEMALE MRS
JOHNSON MRS CLYDE W NA FEMALE MRS
LAMB WILSON MRS C NA FEMALE MRS
LARIMORE WILLIAM MRS NA NA FEMALE MRS
LUU MRS NA NA FEMALE MRS
MABE STEVE MRS NA NA FEMALE MRS
MARTIN JAMES MRS H NA FEMALE MRS
MASSAGEE JAMES H MRS SUE NA FEMALE MRS
MOODY MRS WILLARD W NA FEMALE MRS
MORGAN MRS ROY A NA FEMALE MRS
NICHOLS DORIS ( MRS W NA NA FEMALE MRS
POPE MRS O N JR FEMALE MRS
REICH MRS LESTER G NA FEMALE MRS
RHONEY ROBERT MRS T NA FEMALE MRS
RIVES MRS WILBUR A NA FEMALE MRS
SCALES BETTY MRS H NA FEMALE MRS
SMITH MRS WILLIAM JOE DAVIS NA FEMALE MRS
TIMMONS THOMAS MRS E NA FEMALE MRS
TRULL JAMES MRS T NA FEMALE MRS
WARD MARVIN MRS M NA FEMALE MRS
WHITE JOE MRS MRS NA FEMALE MRS
WOODLEY MRS WALLACE ( RUTH ) NA FEMALE MRS
QUEEN GERALDINE(NMN NA NA FEMALE NMN
BORDERS EUGENE(NMN) NA NA MALE NMN
FOSTER OTIS(NMN) JR NA MALE NMN
FEATHERSTONE REV. ROBERT A NA MALE REV
GILDEA SISTER THERESINE NA FEMALE SISTER
KELLY SISTER ANN NA FEMALE SISTER
PEGUESE SISTER GIRTRUE NA FEMALE SISTER
ROSS SISTER S NA FEMALE SISTER
TANCRAITOR SISTER MAXINE ELIZABETH NA FEMALE SISTER
DUNTON JULIAN SR NA NA MALE SR
GRAHAM STEPHEN SR LEGREE NA MALE SR
PHILLIPS SR DAYLE KELLEY NA MALE SR
ADAMS V JAN NA FEMALE V
ANDERSON V RUTH K NA FEMALE V
BATKIN V MARIA NA FEMALE V
BENFIELD RHONDA V NA NA FEMALE V
BOWDEN V RUTH NA FEMALE V
BOYD V MARIE NA FEMALE V
BRANDT V KATHLEEN GRY NA FEMALE V
CALHOUN V ANNE NA FEMALE V
CARLAND V ANN NA FEMALE V
CARTER PAUL V MRS NA FEMALE V
CAVENDER V DORIS NA FEMALE V
COOK INEZ V CARY NA FEMALE V
DALBERG V ANDREA NA FEMALE V
DOTY V’ONA GILBERT NA FEMALE V
EDWARDS V ERLINE NA FEMALE V
EVANS-SMITH V MARIE HUMPHERY NA FEMALE V
FINLEY V ANNE NA FEMALE V
FUTRELL V JEANINE BOWDEN NA FEMALE V
GIBBS V WILLA NA FEMALE V
GLENN V’SHATAVIA D NA FEMALE V
HALL CATHEDRIA V HOOKER NA FEMALE V
HALL V JUANITA NA FEMALE V
HAMILTON V KAYE NA NA FEMALE V
JAYANTY LAKSHMI S V S NA FEMALE V
JOHNSON V JOLINE NA FEMALE V
KENNEDY V0NCIEAL LEE NA FEMALE V
KRITES V C NA FEMALE V
LANCASTER ALDA V LIMBAUGH NA FEMALE V
LEE V JUANITA NA FEMALE V
LEE V FLORENCE NA FEMALE V
LYONS V BETTIE NA FEMALE V
MARSHALL CALLIE V. LUTZ NA FEMALE V
MAYBERRY V JACQUELINE NA FEMALE V
MOCK V CHARLENE D NA FEMALE V
MOORMAN V E NA FEMALE V
MORTON SANDRA V GOSNELL NA FEMALE V
OSLEY V BONITA NAFZIGER NA FEMALE V
OWENSBY V ANN NA FEMALE V
PAYNE V LUCILLE NA FEMALE V
PERERA V MALLIKA NA FEMALE V
POWELL V ESTELLE NA FEMALE V
RASH V ANDERSON NA FEMALE V
RAY V FRANCIS NA FEMALE V
SEMONCHE LAURA V A NA FEMALE V
SHAFFER V LYNNE STRICKLAND NA FEMALE V
SHELF V S MRS NA FEMALE V
SMELTZER V DIANE NA FEMALE V
SMITH V RAE NA FEMALE V
STANTON V GAYLE NA FEMALE V
STERLING V LEE NA FEMALE V
STODDARD V CHRISTIVE NA FEMALE V
STREIFF CONNIE V R NA FEMALE V
TEAGUE V MICHELLE NA FEMALE V
TERRY CAROLYN V MASK NA FEMALE V
THOMPSON V DELORES NA FEMALE V
TINNEY V LEE W NA FEMALE V
VANNOY V GAIL NA FEMALE V
WAGGONER V C NA FEMALE V
WALKER V FRANCES NA FEMALE V
WHITE V CAROLE NA NA FEMALE V
WILLIAMS JACQUELYNE V. MOORE NA FEMALE V
WRIGHT O V LEDFORD NA FEMALE V
ADAMS A V NA NA MALE V
ADAMS V WAYNE NA MALE V
ALLEN V B NA MALE V
AVVA V SARMA NA MALE V
BARBOUR V KEITH NA MALE V
BAZEMORE V S NA MALE V
BOWMAN V C NA MALE V
BOYKIN V RAYMOND JR MALE V
CLINE V OTHO JR MALE V
CORRELL V C NA MALE V
DEAL R V ROB NA MALE V
DEHART V L JR MALE V
DREYER V DEAN NA MALE V
GORDON V H NA MALE V
HELTON V JOHNNY NA NA MALE V
HICKS V L NA MALE V
HOLLAND V L NA MALE V
HOLLINSHED V E JR MALE V
HONEYCUTT V J NA MALE V
HOUSEHOLDER V R NA MALE V
IDOL V F NA MALE V
IRAGGI V J NA MALE V
IYER V V NA NA MALE V
JACKSON V L NA MALE V
JEFFRIES V’GER S NA MALE V
JONES V W NA MALE V
KRASNIEWICZ V A NA MALE V
KRITES V C NA MALE V
KRYSTOFIAK V L NA MALE V
LEWIS V M NA MALE V
LIND V WILLIAM NA JR MALE V
LOCKAMY V B NA MALE V
LOMBARDI V ALAN NA NA MALE V
MANGIPUDI V RAO NA NA MALE V
MANN R V NA NA MALE V
MARTIN V GRAY JR MALE V
MATHENY V O JR MALE V
MCKINNEY V A NA MALE V
MODLIN V WAYNE NA MALE V
NORMAN V WAYNE NA NA MALE V
OAKLEY V BRADSHER III MALE V
OATES A V NA NA MALE V
OGLESBY V BOYCE JR MALE V
PFAHL V KEVIN NA NA MALE V
PIERANNUNZI V PAUL NA NA MALE V
PLAYER V STEPHEN NA MALE V
POWELL V A JR MALE V
RASH A V NA NA MALE V
REDMOND V PRESTON JR MALE V
REVELS V D NA NA MALE V
REYNOLDS V FRANK NA MALE V
RUMLEY V CLIFTON NA MALE V
SCALDARA A V NA NA MALE V
SHIELDS V E NA MALE V
SLADE V T NA MALE V
TEMPLE V W NA MALE V
WARD V STUART JR MALE V
WHITE A V NA JR MALE V
WHITSON V L NA MALE V
WOOTEN V ALDENE NA MALE V
WYATT V CHARLES NA MALE V
ANTHONY VI JOHNSON NA FEMALE VI
DO VI THUY NA FEMALE VI
GREENE VI HEGE NA FEMALE VI
HUTCHINSON VI THI NA FEMALE VI
LAI VI LE NA FEMALE VI
NGUYEN VI THOAI NA FEMALE VI
NGUYEN VI TUONG NA FEMALE VI
TOWNSEND VI S NA FEMALE VI
VO VI PHUONG NA FEMALE VI
GALLOWAY VI CKY RONALD NA MALE VI
THAI VI KY NA MALE VI
TRAN VI TAN NA MALE VI

I eyeballed the results and removed words which appeared to be mostly validly used.

Invalid words:

  • As whole field: FATHER, III, IV, JR, MD, MR, MRS, SISTER, SR
  • As first word: DR, MISS, MRS, REV, SISTER
  • As last word: III, JR, MRS, NMN, SR
  • As internal word: MRS

4.4.3 Middle name

# regular expression to match words
w_regexp <- 
  c(w_hons, w_gen, w_spec, w_test) %>% # all special words
  unique() %>% # make it a set
  dplyr::setdiff( # remove words that appear to mostly be validly used
    c(
      "BISHOP",
      "BLIND",
      "BR",
      "BROTHER",
      "DOCTOR",
      "ELDER",
      "FIRST",
      "JR", # invalid & too many to display 
      "JUNIOR",
      "MASTER",
      "MISTER",
      "MRS", # invalid & too many to display
      "NMN", # invalid & too many to display
      "PASTOR",
      "SENIOR",
      "SISTER",
      "I",
      "V",
      "VI",
      "VOTER"
    )
  ) %>% 
  glue::glue(x = . , "\\b{x}\\b") %>%  # must be words
  glue::glue_collapse(sep = "|") # search for any

x <- d %>% 
  dplyr::mutate(
    match = 
      midl_name %>% 
      stringr::str_to_upper() %>% 
      stringr::str_replace_all(pattern = "[^ A-Z]", replacement = " ") %>% 
      stringr::str_squish() %>% 
      stringr::str_extract(pattern = w_regexp)
  ) %>% 
  dplyr::filter(!is.na(match))

nrow(x)
[1] 98
x %>% 
  dplyr::arrange(match, sex, last_name, first_name) %>% 
  knitr::kable()
last_name first_name midl_name name_sufx_cd sex match
WISE DIANA AKA NA FEMALE AKA
CACCAMO KATHLEEN DR NA FEMALE DR
DUNCAN ROSALYN DR NA FEMALE DR
GEORGE AMAY DR NA FEMALE DR
VANN ELLEN DR NA FEMALE DR
ELESHA WILLIAM DR NA MALE DR
ROBICSEK FRANCIS DR NA MALE DR
ROPER THOMAS E DR NA MALE DR
VETTER JOHN S DR NA MALE DR
BIRCHFIELD HARRY LYNN II NA MALE II
DINGMAN LEONARD ALAN II NA MALE II
FRADY ROBERT GLENN II NA MALE II
GLOVER CHARLES WORTH II NA MALE II
HAWKINS ROGER LARRY II NA MALE II
HUNTER ERNEST II NA MALE II
KELLY DAVID LEE II NA MALE II
KERR JAMES II NA MALE II
KUHNE KURT II NA MALE II
ROGERS SYLVESTER II SR MALE II
SHERWOOD GEORGE ROYALL II NA MALE II
SOGLUIZZO JOSEPH JOHN II NA MALE II
VAN GORDER CHARLES OSCAR II NA MALE II
WALSTON CHARLES EDWARD II NA MALE II
WATKINS MONROE II NA MALE II
YOUNGMAN THOMAS ARDEN II NA MALE II
BROWN HARRY III NA MALE III
BROWN MILES III NA MALE III
COOPER DALTON III NA MALE III
DAILEY LANGRA III NA MALE III
FUNDERBURK TRAVIS III NA MALE III
GADISON NATHANIEL III NA MALE III
GAY ROBERT HENRY, III. NA MALE III
GEE LAWRENCE III NA MALE III
HARPER GUS III NA MALE III
HOLT ISAAC III NA MALE III
HUMPHREY ROLAND M III NA MALE III
JOHNSON SHADE III NA MALE III
JOYNER DOUGLAS III NA MALE III
LYNCH ABRAHAM III NA MALE III
MCGILVERY ROBERT III NA MALE III
MCILWAIN FERRY III NA MALE III
PHILLIPS ALEXANDER ROW III NA MALE III
PRICE PAUL III III MALE III
STEELE HARVEY III NA MALE III
TERRY GEORGE III NA MALE III
THOMAS PAUL III NA MALE III
BAKER LOUIS IV NA MALE IV
CROSS EUGENE IV NA MALE IV
ESPOSITO VINCENT JOHN IV NA MALE IV
GUNNOE ROBERT FELIX IV NA MALE IV
HORNEY HARRISON MARTIN IV NA MALE IV
HUMBERT JOHN LAWRENCE IV NA MALE IV
BRONSON JENNIFER MD NA FEMALE MD
MCGIMSEY JAMES F JR MD NA MALE MD
BOLES FAUSTINE MISS NA FEMALE MISS
BREEZE ALMA EARL MISS NA FEMALE MISS
DAVIS JULIA MISS NA FEMALE MISS
GARBER CORNELIA MISS NA FEMALE MISS
HAM MABLE MISS NA FEMALE MISS
MCKOY CAROL MISS NA FEMALE MISS
MORRISON LULA MISS NA FEMALE MISS
MOSER ROSE MISS NA FEMALE MISS
PHILSON CHERYL MISS NA FEMALE MISS
ATKINS DAVID GLEN MR NA MALE MR
LIVENGOOD THURMOND MS NA FEMALE MS
STINTZI MANDI LY NN NA FEMALE NN
CRISSMAN JASON LY NN NA MALE NN
GREENE LESTER D(NN) NA MALE NN
LUKER DANIEL B(NN) NA MALE NN
JOHNSON ROBERT REV NA MALE REV
WORKMAN NATHANIEL REV NA MALE REV
ABBAS MOHAMED SR NA MALE SR
ANSTEAD LENDELL SR NA MALE SR
ANTHONY EVERETT SR NA MALE SR
ARMSTON MILTON SR NA MALE SR
ARRINGTON LEROY SR NA MALE SR
BATTLE NATHANIEL SR NA MALE SR
BERRY RALPH SR NA MALE SR
BROWN NELSON SR NA MALE SR
CARTER FOREST SR NA MALE SR
CLARK JEFFERY SR NA MALE SR
DEGRAFFENRIED EDWARD (NMN)SR NA MALE SR
EUBANKS ALBERT SR NA MALE SR
HARRIS MARION SR NA MALE SR
JOHNSON FRED ALAN SR NA MALE SR
JONES WALTER SR NA MALE SR
LANE LORENZA SR NA MALE SR
LUPTON DENNIS WAYNE SR NA MALE SR
LYNCH LOUIS SR NA MALE SR
MILLER CLARENCE SR NA MALE SR
OSBORNE JOHN SR NA MALE SR
PERAGINE PAUL SR NA MALE SR
SELLARS LARRY SR NA MALE SR
STRICKLAND TIMOTHY SR NA MALE SR
WHITAKER WILLIAM SR NA MALE SR
WHITNEY WILLIAM PRESTON SR NA MALE SR
WIGGINS MINOR SR NA MALE SR
WILLIAMS ERVIN W., SR., NA MALE SR

I eyeballed the results and removed words which appeared to be mostly validly used.

Invalid words:

  • As whole field: AKA, DR, II, III, IV, JR, MD, MISS, MRS, MS, NMN, REV, SR
  • As first word: JR, MRS
  • As last word: DR, II, III, IV, JR, MD, MISS, MR, MRS, NMN, NN, SR
  • As internal word: JR

Timing

Computation time (excl. render): 498.294 sec elapsed

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] hexbin_1.28.2   glue_1.4.2      knitr_1.30      skimr_2.1.2    
 [5] fst_0.9.4       forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
 [9] purrr_0.3.4     readr_1.4.0     tidyr_1.1.2     tibble_3.0.4   
[13] ggplot2_3.3.3   tidyverse_1.3.0 tictoc_1.0      here_1.0.1     
[17] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        lattice_0.20-41   lubridate_1.7.9.2 utf8_1.1.4       
 [5] assertthat_0.2.1  rprojroot_2.0.2   digest_0.6.27     repr_1.1.0       
 [9] R6_2.5.0          cellranger_1.1.0  backports_1.2.1   reprex_0.3.0     
[13] evaluate_0.14     highr_0.8         httr_1.4.2        pillar_1.4.7     
[17] rlang_0.4.10      readxl_1.3.1      rstudioapi_0.13   whisker_0.4      
[21] rmarkdown_2.6     labeling_0.4.2    munsell_0.5.0     broom_0.7.3      
[25] compiler_4.0.3    httpuv_1.5.4      modelr_0.1.8      xfun_0.20        
[29] base64enc_0.1-3   pkgconfig_2.0.3   htmltools_0.5.0   tidyselect_1.1.0 
[33] bookdown_0.21     fansi_0.4.1       crayon_1.3.4      dbplyr_2.0.0     
[37] withr_2.3.0       later_1.1.0.1     grid_4.0.3        jsonlite_1.7.2   
[41] gtable_0.3.0      lifecycle_0.2.0   DBI_1.1.0         git2r_0.28.0     
[45] magrittr_2.0.1    scales_1.1.1      cli_2.2.0         stringi_1.5.3    
[49] farver_2.0.3      renv_0.12.5       fs_1.5.0          promises_1.1.1   
[53] xml2_1.3.2        ellipsis_0.3.1    generics_0.1.0    vctrs_0.3.6      
[57] tools_4.0.3       hms_0.5.3         parallel_4.0.3    yaml_2.2.1       
[61] colorspace_2.0-0  rvest_0.3.6       haven_2.3.1