Last updated: 2020-12-23

Checks: 7 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version c6390cc. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    data/test.txt
    Ignored:    data/test.txt.xz
    Ignored:    output/d.fst
    Ignored:    renv/library/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  data/layout_VR_Snapshot.txt

Unstaged changes:
    Modified:   .gitignore
    Modified:   data/.gitignore

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01_get_check_data.Rmd) and HTML (docs/01_get_check_data.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd c6390cc Ross Gayler 2020-12-23 wflow_publish("analysis/*.Rmd")
Rmd 01b669c Ross Gayler 2020-12-10 Build site.
Rmd bbb7d9d Ross Gayler 2020-12-07 End of day
Rmd babb874 Ross Gayler 2020-12-06 End of day

library(here)
here() starts at /home/ross/RG/projects/academic/entity_resolution/fa_sim_cal_TOP/fa_sim_cal
library(magrittr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stringr)
library(vroom)
library(skimr)
library(knitr)

1 Introduction

Read the data, characterise it to understand it, and check for possible gotchas.

This project uses historical voter registration data from the North Carolina State Board of Elections. This information is made publicly available in accordance with North Carolina state law. The Voter Registration Data page links to a folder of Voter Registration snapshots, which contains the snapshot data files and a metadata file describing the layout of the snapshot data files. At the time of writing the snapshot files cover the years 2005 to 2020 with at least one snapshot per year. The files are ZIP compressed and relatively large, with the smallest being 572 MB after compression.

The snapshots contains many columns that are irrelevant to this project and/or prohibited under Australian privacy law (e.g. political affiliation, race). We initially read all the columns, because that may help debugging the inevitable problems reading the data. Later the data set will be restricted to the essential columns for the project.

We use only one snapshot file (VR_Snapshot_20051125.zip) because this project does not investigate linkage of records across time. We chose the oldest snapshot (2005) because it is the smallest and the contents are the most out of date, minimising the current information made available. Note that this project will not generate any information that is not already directly, publicly available from NCSBE.

2 Read data

The snapshot ZIP file was downloaded, uncompressed (5.7 GB), then compressed in XZ format to minimise the size. The compressed snapshot file and the metadata file are stored in the data directory.

raw_file <- here::here("data", "VR_20051125.txt.xz") # raw input file

The cleaned data is stored as an fst format file in the output directory.

d_fst <- here::here("output", "d.fst") # temporary data file
clean_fst <- here::here("output", "clean.fst") # parsed and cleaned data as a dataframe

The data is tab-separated, not fixed-width as you might reasonably think from reading the metadata. The field widths (interpreted as maximum lengths) in the metadata are not accurate. Some fields contain values longer than the stated width.

Inspection of the raw data shows that the character fields are unquoted. However, at least one character value contains a double-quote character, which has the potential to confuse the parsing if it is looking for quoted values.

d <- vroom::vroom( #read raw data; let vroom guess the field types
  raw_file,
  delim = "\t", # assume that fields are *only* delimited by tabs
  col_names = TRUE, # use the column names on the first line of data
  na = "", # missing fields are empty string or whitespace only (see trim_ws argument)
  quote = "", # don't allow for quoted strings
  comment = "", # don't allow for comments
  trim_ws = TRUE, # trim leading and trailing whitespace
  escape_double = FALSE, # assume no escaped quotes
  escape_backslash = FALSE # assume no escaped backslashes
  )
fst::write_fst(d, d_fst, compress = 100) # save data frame (cheap-skate caching)
d <- fst::read_fst(d_fst) %>% tibble::as_tibble() # get cached data
dim(d)
[1] 8003293      90
  • Correct number of data rows extracted (external line count of input file = 8,003,294)

3 Characterise data (all records)

Take a very quick look at everything then concentrate on the columns that have a chance of being useful.

glimpse(d)
Rows: 8,003,293
Columns: 90
$ snapshot_dt              <dttm> 2005-11-25, 2005-11-25, 2005-11-25, 2005-11…
$ county_id                <dbl> 18, 7, 10, 16, 58, 60, 62, 73, 74, 87, 99, 3…
$ county_desc              <chr> "CATAWBA", "BEAUFORT", "BRUNSWICK", "CARTERE…
$ voter_reg_num            <chr> "0", "000000000000", "000000000000", "000000…
$ ncid                     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ status_cd                <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ voter_status_desc        <chr> "REMOVED", "REMOVED", "REMOVED", "REMOVED", …
$ reason_cd                <chr> "RL", "R2", "R2", "RP", "R2", "RL", "RP", "R…
$ voter_status_reason_desc <chr> "MOVED FROM COUNTY", "DUPLICATE", "DUPLICATE…
$ absent_ind               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ name_prefx_cd            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ last_name                <chr> "AARON", "THOMPSON", "WILSON", "LANGSTON", "…
$ first_name               <chr> "CHARLES", "JESSICA", "WILLIAM", "VON", "LIZ…
$ midl_name                <chr> "F", "RUTH", "B", NA, "IRENE", "R", "HUGHES"…
$ name_sufx_cd             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ house_num                <dbl> 0, 961, 0, 264, 1536, 1431, 171, 0, 0, 1000,…
$ half_code                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ street_dir               <chr> NA, NA, NA, NA, NA, "E", NA, NA, NA, NA, NA,…
$ street_name              <chr> "ROUTE 4", "TAYLOR", "MIRROR LAKE", "CARL GA…
$ street_type_cd           <chr> NA, "RD", NA, "RD", "RD", "ST", NA, NA, NA, …
$ street_sufx_cd           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ unit_designator          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ unit_num                 <chr> "147 BA", NA, NA, NA, NA, "1", NA, NA, NA, N…
$ res_city_desc            <chr> "CONOVER", "CHOCOWINITY", "BOILING SPRING LA…
$ state_cd                 <chr> "NC", "NC", "NC", "NC", "NC", "NC", "NC", NA…
$ zip_code                 <dbl> 28613, 27817, 28461, 28570, 27892, 28204, 27…
$ mail_addr1               <chr> NA, "619A FOUNDERS HALL, CP0 # 9100", NA, NA…
$ mail_addr2               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ mail_addr3               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ mail_addr4               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ mail_city                <chr> NA, "ASHEVILLE", NA, NA, NA, NA, "CANDOR", N…
$ mail_state               <chr> NA, "NC", NA, NA, NA, NA, "NC", NA, NA, NA, …
$ mail_zipcode             <dbl> NA, 0, NA, NA, NA, NA, 27229, NA, NA, NA, NA…
$ area_cd                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ phone_num                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ race_code                <chr> "W", "W", "U", "B", "W", "W", "W", "U", "U",…
$ race_desc                <chr> "WHITE", "WHITE", "UNDESIGNATED", "BLACK or …
$ ethnic_code              <chr> "NL", "NL", "NL", "NL", "NL", "NL", "NL", "N…
$ ethnic_desc              <chr> "NOT HISPANIC or NOT LATINO", "NOT HISPANIC …
$ party_cd                 <chr> "REP", "REP", "UNA", "DEM", "REP", "UNA", "D…
$ party_desc               <chr> "REPUBLICAN", "REPUBLICAN", "UNAFFILIATED", …
$ sex_code                 <chr> "M", "F", "U", "M", "F", "F", "M", "U", "U",…
$ sex                      <chr> "MALE", "FEMALE", "UNK", "MALE", "FEMALE", "…
$ age                      <dbl> 62, 26, 0, 58, 63, 30, 93, 0, 0, 82, 57, 72,…
$ birth_place              <chr> NA, "NC", NA, "MI", NA, "VA", "NC", NA, NA, …
$ registr_dt               <dttm> 1984-10-06, 2000-07-31, 1900-01-01, 1978-04…
$ precinct_abbrv           <chr> NA, "CHOCO", NA, NA, NA, NA, NA, NA, NA, "BC…
$ precinct_desc            <chr> NA, "CHOCOWINITY", NA, NA, NA, NA, NA, NA, N…
$ municipality_abbrv       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "JNV…
$ municipality_desc        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "JON…
$ ward_abbrv               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ward_desc                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ cong_dist_abbrv          <chr> NA, "01", NA, NA, NA, NA, NA, NA, NA, "11", …
$ cong_dist_desc           <chr> NA, "1ST CONGRESS", NA, NA, NA, NA, NA, NA, …
$ super_court_abbrv        <chr> NA, "02", NA, NA, NA, NA, NA, NA, NA, "30A",…
$ super_court_desc         <chr> NA, "2ND SUPERIOR COURT", NA, NA, NA, NA, NA…
$ judic_dist_abbrv         <chr> NA, "02", NA, NA, NA, NA, NA, NA, NA, "30", …
$ judic_dist_desc          <chr> NA, "2ND JUDICIAL", NA, NA, NA, NA, NA, NA, …
$ NC_senate_abbrv          <chr> NA, "01", NA, NA, NA, NA, NA, NA, NA, "50", …
$ NC_senate_desc           <chr> NA, "1ST SENATE", NA, NA, NA, NA, NA, NA, NA…
$ NC_house_abbrv           <chr> NA, "006", NA, NA, NA, NA, NA, NA, NA, "119"…
$ NC_house_desc            <chr> NA, "6TH HOUSE", NA, NA, NA, NA, NA, NA, NA,…
$ county_commiss_abbrv     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ county_commiss_desc      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ township_abbrv           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ township_desc            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ school_dist_abbrv        <chr> NA, "SD2", NA, NA, NA, NA, NA, NA, NA, NA, N…
$ school_dist_desc         <chr> NA, "SCHOOL #2", NA, NA, NA, NA, NA, NA, NA,…
$ fire_dist_abbrv          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ fire_dist_desc           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ water_dist_abbrv         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ water_dist_desc          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sewer_dist_abbrv         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sewer_dist_desc          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sanit_dist_abbrv         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sanit_dist_desc          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ rescue_dist_abbrv        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ rescue_dist_desc         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ munic_dist_abbrv         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ munic_dist_desc          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ dist_1_abbrv             <chr> NA, "02", NA, NA, NA, NA, NA, NA, NA, "30", …
$ dist_1_desc              <chr> NA, "2ND PROSECUTORIAL", NA, NA, NA, NA, NA,…
$ dist_2_abbrv             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ dist_2_desc              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ confidential_ind         <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N",…
$ cancellation_dt          <dttm> NA, 2001-07-06, 2001-02-05, NA, 2001-03-15,…
$ vtd_abbrv                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ vtd_desc                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ load_dt                  <dttm> 2014-07-15 22:21:54, 2014-07-15 22:21:54, 2…
$ age_group                <chr> "41 TO 65", "26 TO 40", "UNKNOWN", "41 TO 65…
skimr::skim(d)
Warning in grepl("^\\s+$", x): input string 3907396 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 3975334 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 388213 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 503879 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 817815 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 7446786 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 7446791 is invalid in this locale
Table 3.1: Data summary
Name d
Number of rows 8003293
Number of columns 90
_______________________
Column type frequency:
character 59
logical 20
numeric 7
POSIXct 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
county_desc 0 1.00 3 12 0 100 0
voter_reg_num 0 1.00 1 12 0 2708878 0
status_cd 2 1.00 1 1 0 5 0
voter_status_desc 2 1.00 6 22 0 5 0
reason_cd 238 1.00 2 2 0 26 0
voter_status_reason_desc 238 1.00 8 56 0 26 0
last_name 122 1.00 1 23 0 269312 0
first_name 254 1.00 1 19 0 176806 0
midl_name 553015 0.93 1 20 0 249768 0
name_sufx_cd 7561920 0.06 1 3 0 222 0
street_dir 7409655 0.07 1 2 0 15 0
street_name 7768 1.00 1 30 0 122064 0
street_type_cd 527462 0.93 1 4 0 215 0
street_sufx_cd 7698925 0.04 1 3 0 15 0
unit_num 7020919 0.12 1 7 0 32785 0
res_city_desc 3750 1.00 3 20 0 856 0
state_cd 7277 1.00 1 2 0 20 0
mail_addr1 6814780 0.15 1 40 0 421307 0
mail_city 6819798 0.15 1 30 0 4168 0
mail_state 6819868 0.15 1 2 0 104 0
phone_num 5370357 0.33 1 7 0 1539509 0
race_code 0 1.00 1 1 0 7 0
race_desc 0 1.00 5 34 0 7 0
ethnic_code 0 1.00 2 2 0 3 0
ethnic_desc 0 1.00 12 26 0 3 0
party_cd 0 1.00 3 3 0 4 0
party_desc 0 1.00 10 13 0 5 0
sex_code 0 1.00 1 1 0 3 0
sex 0 1.00 3 6 0 3 0
birth_place 1716730 0.79 2 2 0 56 0
precinct_abbrv 1865111 0.77 1 6 0 1867 0
precinct_desc 1865111 0.77 2 30 0 2686 0
municipality_abbrv 4396616 0.45 1 4 0 429 0
municipality_desc 4396616 0.45 4 26 0 571 0
ward_abbrv 6116249 0.24 1 4 0 197 0
ward_desc 6116249 0.24 1 28 0 256 0
cong_dist_abbrv 1865114 0.77 2 2 0 13 0
cong_dist_desc 1865114 0.77 2 27 0 46 0
super_court_abbrv 1872590 0.77 2 4 0 68 0
super_court_desc 1872590 0.77 2 30 0 78 0
judic_dist_abbrv 1872576 0.77 2 3 0 40 0
judic_dist_desc 1872576 0.77 2 23 0 54 0
NC_senate_abbrv 1836472 0.77 2 2 0 50 0
NC_senate_desc 1836472 0.77 6 24 0 63 0
NC_house_abbrv 1829345 0.77 3 3 0 120 0
NC_house_desc 1829345 0.77 6 25 0 125 0
county_commiss_abbrv 4365150 0.45 1 4 0 126 0
county_commiss_desc 4365150 0.45 2 30 0 131 0
township_abbrv 6760420 0.16 1 4 0 119 0
township_desc 6760420 0.16 1 27 0 223 0
school_dist_abbrv 3380612 0.58 1 7 0 140 0
school_dist_desc 3380612 0.58 2 30 0 145 0
fire_dist_abbrv 7650404 0.04 1 4 0 82 0
fire_dist_desc 7650404 0.04 5 27 0 107 0
rescue_dist_desc 7885291 0.01 10 16 0 13 0
dist_1_abbrv 1865111 0.77 2 3 0 39 0
dist_1_desc 1865111 0.77 2 27 0 51 0
confidential_ind 0 1.00 1 1 0 2 0
age_group 0 1.00 7 12 0 6 0

Variable type: logical

skim_variable n_missing complete_rate mean count
ncid 8003293 0 NaN :
absent_ind 8003293 0 NaN :
name_prefx_cd 8003293 0 NaN :
half_code 8002085 0 0.38 FAL: 752, TRU: 456
unit_designator 8003293 0 NaN :
mail_addr2 8003292 0 1.00 TRU: 1
mail_addr3 8003293 0 NaN :
mail_addr4 8003293 0 NaN :
water_dist_abbrv 7998651 0 1.00 TRU: 4642
water_dist_desc 8000971 0 1.00 TRU: 2322
sewer_dist_abbrv 8002465 0 1.00 TRU: 828
sewer_dist_desc 8003293 0 NaN :
sanit_dist_abbrv 7997607 0 0.11 FAL: 5069, TRU: 617
sanit_dist_desc 8003293 0 NaN :
munic_dist_abbrv 8002280 0 1.00 TRU: 1013
munic_dist_desc 8002280 0 1.00 TRU: 1013
dist_2_abbrv 8003293 0 NaN :
dist_2_desc 8003293 0 NaN :
vtd_abbrv 8003293 0 NaN :
vtd_desc 8003293 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
county_id 0 1.00 51.96 27.31 1 32 51 74 100 ▅▇▇▆▆
house_num 0 1.00 2664.17 706533.11 0 210 900 3032 1400000000 ▇▁▁▁▁
zip_code 17957 1.00 30806.46 890299.61 0 27523 28027 28401 289309205 ▇▁▁▁▁
mail_zipcode 6819826 0.15 24463505.17 78280243.02 -27379 27812 28345 28699 987725001 ▇▁▁▁▁
area_cd 5621640 0.30 696.09 259.80 -83 336 828 910 999 ▁▃▁▂▇
age 0 1.00 48.71 21.28 0 34 46 60 7644 ▇▁▁▁▁
rescue_dist_abbrv 7885291 0.01 47.54 10.66 12 41 54 55 88 ▁▃▇▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
snapshot_dt 0 1.00 2005-11-25 00:00:00 2005-11-25 00:00:00 2005-11-25 00:00:00 1
registr_dt 0 1.00 1805-08-01 00:00:00 9999-10-21 00:00:00 1995-02-22 00:00:00 75089
cancellation_dt 6240946 0.22 1988-12-06 00:00:00 2005-11-23 00:00:00 2003-01-13 00:00:00 3975
load_dt 0 1.00 2014-07-15 22:21:54 2014-07-15 22:21:54 2014-07-15 22:21:54 1
  • The warning messages from skim() indicate that a handful of rows contain unexpected characters. If they are in rows we use they will have to be loacted and dealt with.

3.1 county_id & county_desc

county_id: County identification number
county_desc: County description

summary(d$county_id)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   32.00   51.00   51.96   74.00  100.00 
table(d$county_id)

     1      2      3      4      5      6      7      8      9     10     11 
142978  32527  10606  21692  23969  19839  41680  17955  30594  83782 196729 
    12     13     14     15     16     17     18     19     20     21     22 
 76685 129792  68065   8558  64968  19521 158203  45262  27093  11937  10507 
    23     24     25     26     27     28     29     30     31     32     33 
 74946  49306  98242 250411  20356  33701 124906  27493  34816 324683  51893 
    34     35     36     37     38     39     40     41     42     43     44 
350882  41914 140041   8765   8810  40697  15672 473739  43500  74559  63376 
    45     46     47     48     49     50     51     52     53     54     55 
101988  20477  29683   5004 103777  39634 111748  10102  40144  50822  62544 
    56     57     58     59     60     61     62     63     64     65     66 
 31436  19721  24684  38463 697897  15768  24296  67268  81129 185852  17200 
    67     68     69     70     71     72     73     74     75     76     77 
106315 227603  15232  40871  40743  10037  30596 179177  23721 107895  36649 
    78     79     80     81     82     83     84     85     86     87     88 
 90736  87196 115178  49978  45743  28494  52563  37016  53848  14744  38191 
    89     90     91     92     93     94     95     96     97     98     99 
  3445 122676  39355 678226  19534  15399  59440  81209  55298  70609  31275 
   100 
 19014 
  • Never missing
  • Integer 1 .. 100
table(d$county_desc)

    ALAMANCE    ALEXANDER    ALLEGHANY        ANSON         ASHE        AVERY 
      142978        32527        10606        21692        23969        19839 
    BEAUFORT       BERTIE       BLADEN    BRUNSWICK     BUNCOMBE        BURKE 
       41680        17955        30594        83782       196729        76685 
    CABARRUS     CALDWELL       CAMDEN     CARTERET      CASWELL      CATAWBA 
      129792        68065         8558        64968        19521       158203 
     CHATHAM     CHEROKEE       CHOWAN         CLAY    CLEVELAND     COLUMBUS 
       45262        27093        11937        10507        74946        49306 
      CRAVEN   CUMBERLAND    CURRITUCK         DARE     DAVIDSON        DAVIE 
       98242       250411        20356        33701       124906        27493 
      DUPLIN       DURHAM    EDGECOMBE      FORSYTH     FRANKLIN       GASTON 
       34816       324683        51893       350882        41914       140041 
       GATES       GRAHAM    GRANVILLE       GREENE     GUILFORD      HALIFAX 
        8765         8810        40697        15672       473739        43500 
     HARNETT      HAYWOOD    HENDERSON     HERTFORD         HOKE         HYDE 
       74559        63376       101988        20477        29683         5004 
     IREDELL      JACKSON     JOHNSTON        JONES          LEE       LENOIR 
      103777        39634       111748        10102        40144        50822 
     LINCOLN        MACON      MADISON       MARTIN     MCDOWELL  MECKLENBURG 
       62544        31436        19721        24684        38463       697897 
    MITCHELL   MONTGOMERY        MOORE         NASH  NEW HANOVER  NORTHAMPTON 
       15768        24296        67268        81129       185852        17200 
      ONSLOW       ORANGE      PAMLICO   PASQUOTANK       PENDER   PERQUIMANS 
      106315       227603        15232        40871        40743        10037 
      PERSON         PITT         POLK     RANDOLPH     RICHMOND      ROBESON 
       30596       179177        23721       107895        36649        90736 
  ROCKINGHAM        ROWAN   RUTHERFORD      SAMPSON     SCOTLAND       STANLY 
       87196       115178        49978        45743        28494        52563 
      STOKES        SURRY        SWAIN TRANSYLVANIA      TYRRELL        UNION 
       37016        53848        14744        38191         3445       122676 
       VANCE         WAKE       WARREN   WASHINGTON      WATAUGA        WAYNE 
       39355       678226        19534        15399        59440        81209 
      WILKES       WILSON       YADKIN       YANCEY 
       55298        70609        31275        19014 
  • Never missing
  • 100 unique values

They look reasonable, to the extent that I can tell without knowing anything about the counties.

3.2 voter_reg_num

voter_reg_num: Voter registration number (unique by county)

table(d$voter_reg_num) %>% head(12)

           0 000000000000 000000000001 000000000002 000000000003 000000000004 
           1           10           56           64           65           66 
000000000005 000000000006 000000000007 000000000008 000000000009 000000000010 
          61           65           70           64           75           71 
table(d$voter_reg_num) %>% tail(12)

000999834828 000999834834 000999834837 000999834845 000999834860 000999834869 
           1            1            1            1            1            1 
000999834879 000999834883 000999834884 000999834888 000999834892 000999834900 
           1            1            1            1            1            1 
summary(as.integer(d$voter_reg_num))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
        0     36265    155221   5459965   3039980 999834900 
d$voter_reg_num %>% stringr::str_length() %>% table(useNA = "ifany")
.
      1      12 
      1 8003292 
  • ~2.7M unique values
  • Never missing
  • Integer 0 .. ~1,000M (as strings)
  • Looks like they should be 12-digit integers with leading zeroes
  • Exactly one observation is short

Look at the record with the short value.

d %>% 
  dplyr::filter(stringr::str_length(voter_reg_num) < 12) %>% 
  dplyr::select(county_id, voter_reg_num, status_cd, voter_status_desc, reason_cd, voter_status_reason_desc) %>% 
  knitr::kable()
county_id voter_reg_num status_cd voter_status_desc reason_cd voter_status_reason_desc
18 0 R REMOVED RL MOVED FROM COUNTY
  • There is only one short value which can be ignored because it will later be excluded from the data set because of the observation’s status -not active. (I intend to later restrict the data set to only active voters because, to the greatest extent possible, I want to have no duplicate records in the data used for the analyses.)

Check whether county_id x voter_reg_num is unique, as claimed.

d %>% 
  dplyr::select(county_id, voter_reg_num) %>% 
  dplyr::mutate(id = stringr::str_c(as.character(county_id), ".", voter_reg_num)) %>% 
  dplyr::count(id) %>% 
  with(table(n))
n
      1 
8003293 
  • county_id x voter_reg_num is unique, even including observations flagged as duplicates.

3.3 ncid

ncid: North Carolina identification number (NCID) of voter

  • Always missing

That’s a shame. It would have been useful.

3.4 status_cd & voter_status_desc

status_cd: Status code for voter registration
voter_status_desc: Status code description

table(d$status_cd, useNA = "always")

      A       D       I       R       S    <NA> 
4914521   41348  495603 2546485    5334       2 
table(d$voter_status_desc, useNA = "always")

                ACTIVE                 DENIED               INACTIVE 
               4914521                  41348                 495603 
               REMOVED TEMPORARY REGISTRATION                   <NA> 
               2546485                   5334                      2 
  • 5 unique nonmissing values
  • 2 records with missing values
  • ~4.9M active records

3.5 reason_cd & voter_status_reason_desc

reason_cd: Reason code for voter registration status
voter_status_reason_desc: Reason code description

table(d$reason_cd, useNA = "always")

     A1      A2      AA      AL      AN      AP      AV      DI      DU      IL 
  13737   71296      50  523899    7517  198333 4100220    6991   34357   10585 
     IN      IU      R2      RA      RC      RD      RF      RL      RM      RP 
 181320  303197   78951   59008     662  443486   63501  888056  551073  367511 
     RQ      RS      RT      SM      SO      SP    <NA> 
   4194   89049     729    3975    1307      51     238 
table(d$voter_status_reason_desc, useNA = "always")

                                          ADMINISTRATIVE 
                                                   59008 
                                            ARMED FORCES 
                                                      50 
                               CONFIRMATION NOT RETURNED 
                                                  181320 
                                    CONFIRMATION PENDING 
                                                   71296 
                     CONFIRMATION RETURNED UNDELIVERABLE 
                                                  303197 
                                                DECEASED 
                                                  443486 
                                               DUPLICATE 
                                                   78951 
                                       FELONY CONVICTION 
                                                   63501 
                                     LEGACY - CONVERSION 
                                                   10585 
                                             LEGACY DATA 
                                                  523899 
                                                MILITARY 
                                                    3975 
                                       MOVED FROM COUNTY 
                                                  888056 
                                        MOVED FROM STATE 
                                                   89049 
                                        OVERSEAS CITIZEN 
                                                    1307 
                                   PREVIOUSLY REGISTERED 
                                                      51 
REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS 
                                                  551073 
                      REMOVED DUE TO SUSTAINED CHALLENGE 
                                                     662 
                             REMOVED UNDER OLD PURGE LAW 
                                                  367511 
                                      REQUEST FROM VOTER 
                                                    4194 
                                    TEMPORARY REGISTRANT 
                                                     729 
                       UNAVAILABLE ESSENTIAL INFORMATION 
                                                    6991 
                                              UNVERIFIED 
                                                   13737 
                                          UNVERIFIED NEW 
                                                    7517 
                                    VERIFICATION PENDING 
                                                  198333 
                     VERIFICATION RETURNED UNDELIVERABLE 
                                                   34357 
                                                VERIFIED 
                                                 4100220 
                                                    <NA> 
                                                     238 
  • 26 unique nonmissing values
  • 238 records with missing values
  • ~4.1M verified records

Look at the relationship between status and status reason.

table(
  stringr::str_trunc(d$voter_status_reason_desc, 25), 
  stringr::str_trunc(d$voter_status_desc, 8), 
  useNA = "always"
)
                           
                             ACTIVE  DENIED INACTIVE REMOVED TEMPO...    <NA>
  ADMINISTRATIVE                  0       0        0   59008        0       0
  ARMED FORCES                   50       0        0       0        0       0
  CONFIRMATION NOT RETURNED       0       0   181320       0        0       0
  CONFIRMATION PENDING        71295       0        0       1        0       0
  CONFIRMATION RETURNED ...       0       0   303197       0        0       0
  DECEASED                        0       0        0  443486        0       0
  DUPLICATE                       0       0        0   78951        0       0
  FELONY CONVICTION               0       0        0   63501        0       0
  LEGACY - CONVERSION             1       0    10584       0        0       0
  LEGACY DATA                523897       0        2       0        0       0
  MILITARY                        0       0        0       0     3975       0
  MOVED FROM COUNTY               0       0        0  888055        0       1
  MOVED FROM STATE                0       0        0   89049        0       0
  OVERSEAS CITIZEN                0       0        0       0     1307       0
  PREVIOUSLY REGISTERED           0       0        0       1       50       0
  REMOVED AFTER 2 FED GE...       0       0        0  551072        0       1
  REMOVED DUE TO SUSTAIN...       0       0        0     662        0       0
  REMOVED UNDER OLD PURG...       0       0        0  367511        0       0
  REQUEST FROM VOTER              0       0        0    4194        0       0
  TEMPORARY REGISTRANT            0       0        0     729        0       0
  UNAVAILABLE ESSENTIAL ...       0    6990        0       1        0       0
  UNVERIFIED                  13731       0        0       4        2       0
  UNVERIFIED NEW               7516       0        0       1        0       0
  VERIFICATION PENDING       198331       0        1       1        0       0
  VERIFICATION RETURNED ...       0   34357        0       0        0       0
  VERIFIED                  4099700       1      499      20        0       0
  <NA>                            0       0        0     238        0       0
  • voter_status_desc == “ACTIVE” & voter_status_reason_desc == “VERIFIED”

    • Most likely to be error free (based on common-sense interpretation of the labels)
    • ~4.1M observations

3.6 Name standardisation

Identify any oddities about the name fields that might benefit from standardisation.

I will do this on all the rows, not just the subset to be analysed, because I expect the oddities to be much the same independently of whether I will exclude the rows from the analyses and the larger sample size will be helpful in spotting rare problems.

I will look at the three name fields concurrently because I expect the oddities to be similar across the name fields.

  • last_name: Voter last name
  • first_name: Voter first name
  • midl_name: Voter middle name

Look for possible anomalies in names.

3.6.1 Name missing

d %>% with(table(is.na(last_name)))

  FALSE    TRUE 
8003171     122 
d %>% with(table(is.na(first_name)))

  FALSE    TRUE 
8003039     254 
d %>% with(table(is.na(midl_name)))

  FALSE    TRUE 
7450278  553015 
  • A small fraction of last and first names are missing. We don’t expect them to be missing.
  • A significant fraction of middle names are missing. This is expected as middle names are not mandatory.

Look at the records missing last or first names to see if there is some explanation for their absence.

# last name missing
d %>% 
  dplyr::filter(is.na(last_name)) %>% 
  dplyr::select(
    first_name, midl_name, name_sufx_cd, 
    sex, age,
    house_num, street_name, 
    voter_status_desc, voter_status_reason_desc
    ) %>% 
  dplyr::arrange(voter_status_desc, voter_status_reason_desc, first_name) %>% 
  knitr::kable()
first_name midl_name name_sufx_cd sex age house_num street_name voter_status_desc voter_status_reason_desc
CHRISTINA GAYLE NA FEMALE 27 6024 ROY LEE WOODS REMOVED ADMINISTRATIVE
STEPHANIE ELISE NA FEMALE 25 4877 COLLEGE ACRES REMOVED ADMINISTRATIVE
WILLIAM TODD NA MALE 41 1396 US HWY 221 REMOVED ADMINISTRATIVE
A J NA FEMALE 94 146 RURAL RTE 1 REMOVED DECEASED
ALBERT FREEMAN NA MALE 82 0 UNKNOWN REMOVED DECEASED
BROUNDA KAY NA FEMALE 58 207 7TH REMOVED DECEASED
CLARENCE EDWARD NA MALE 85 230 WAYCROSS REMOVED DECEASED
COLON WALTER NA MALE 71 642 ANNS REMOVED DECEASED
ELOISE L NA FEMALE 0 0 RURAL RTE 2 REMOVED DECEASED
GENE EDWARD NA MALE 74 105 PERRY FOX REMOVED DECEASED
HELEN KOOPS NA FEMALE 89 43 RANGEVIEW ACRES REMOVED DECEASED
JAMES A NA MALE 75 28 PINE SHORE REMOVED DECEASED
JAMES EARL NA MALE 69 3845 ALLISON REMOVED DECEASED
JOHN ROBERT NA MALE 87 0 FRYEMONT REMOVED DECEASED
MARTHA BOATRIGHT NA FEMALE 77 4006 RICHLANDS REMOVED DECEASED
MELISSA O NA FEMALE 39 4912 DEVIL’S RACETRACK REMOVED DECEASED
VERA M NA FEMALE 76 231 PO BOX REMOVED DECEASED
VOLA B NA FEMALE 98 0 TALLULAH REMOVED DECEASED
CHARLES EMMETT NA MALE 73 340 VANDERBILT REMOVED DUPLICATE
FANNIE N NA FEMALE 77 115 CAROLINA REMOVED DUPLICATE
PATRICIA C NA FEMALE 75 1425 WARRIOR REMOVED DUPLICATE
PAULINE NA NA FEMALE 56 8 REDBUD REMOVED DUPLICATE
ROBERT ERIC NA MALE 40 56 KILGORE REMOVED DUPLICATE
VIRGINIA L NA FEMALE 90 428 RINK DAM REMOVED DUPLICATE
WELDON COX NA MALE 76 109 MCNEILL REMOVED DUPLICATE
DEONTRAYVIA EMANUEL NA MALE 30 1508 CHARLES REMOVED FELONY CONVICTION
JANE ANN NA FEMALE 26 157 MT PILOT MHP REMOVED FELONY CONVICTION
KIM LEE NA MALE 51 114 LYLE KNOB REMOVED FELONY CONVICTION
LEANDER WARREN NA MALE 43 53 MOUNTAIN VIEW REMOVED FELONY CONVICTION
MIKE J NA MALE 51 0 SNIDER REMOVED FELONY CONVICTION
SHIRLEY GRIFFIN NA FEMALE 40 1138 ROCKY RUN REMOVED FELONY CONVICTION
WESLEY WILSON NA MALE 41 195 HIGH POINT REMOVED FELONY CONVICTION
WILLIAM RAY NA MALE 43 412 OAK REMOVED FELONY CONVICTION
AMY DENISE NA FEMALE 34 5439 LILLY FLOWER REMOVED MOVED FROM COUNTY
ANDREA CROUCH NA FEMALE 35 90 FOREST OAKS REMOVED MOVED FROM COUNTY
CAROLYN MOORE NA FEMALE 56 9830 RIDGEVILLE REMOVED MOVED FROM COUNTY
DAVID DEAN NA MALE 38 98 CEDAR REMOVED MOVED FROM COUNTY
FREDDA M NA FEMALE 82 16930 KNOXWOOD REMOVED MOVED FROM COUNTY
JAMES DONALD III MALE 45 49 WASHINGTON REMOVED MOVED FROM COUNTY
JESSIE H NA FEMALE 81 206 JONES REMOVED MOVED FROM COUNTY
JUDITH A NA FEMALE 44 2338 PROVIDENCE CREEK REMOVED MOVED FROM COUNTY
KATHLEEN LOUISE NA FEMALE 23 302 UNIVERSITY REMOVED MOVED FROM COUNTY
KELLY R NA FEMALE 38 2315 TORRINGTON REMOVED MOVED FROM COUNTY
LARRY ANTHONY SR MALE 46 0 TRENT REMOVED MOVED FROM COUNTY
LARRY DALLAS NA MALE 63 1045 HUNTER CREEK REMOVED MOVED FROM COUNTY
MARY MOSELEY NA FEMALE 46 407 BUTLER REMOVED MOVED FROM COUNTY
MATTHEW JAMES NA MALE 25 243 7TH REMOVED MOVED FROM COUNTY
MIRANDA MARIE NA FEMALE 23 908 LOGAN REMOVED MOVED FROM COUNTY
NATALIE BASSHAM NA FEMALE 32 4801 HOWE REMOVED MOVED FROM COUNTY
PATSY D NA FEMALE 50 825 CENTER REMOVED MOVED FROM COUNTY
SHIELA WEST NA FEMALE 57 176 SHARON VALLEY REMOVED MOVED FROM COUNTY
STELLA NORWOOD NA FEMALE 41 137 CAMBRIDGE REMOVED MOVED FROM COUNTY
NA NA NA UNK 0 0 NA REMOVED MOVED FROM COUNTY
HENRY RAY NA MALE 69 794 PAUL PAYNE STORE REMOVED MOVED FROM STATE
JASON M NA MALE 35 11421 FOUNTAINGROVE REMOVED MOVED FROM STATE
L KENT NA MALE 65 5 PALM REMOVED MOVED FROM STATE
LINDA LOU NA FEMALE 58 134 SAM RICHARDSON REMOVED MOVED FROM STATE
ROBERT CARL NA MALE 56 867 GEORGE’S GAP REMOVED MOVED FROM STATE
ROY W NA MALE 0 1329 DEVONSHIRE REMOVED MOVED FROM STATE
NA NA NA UNK 0 0 NA REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
DUOC VAN DO UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
JEREMY SEAN NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
L F III MALE 58 520 CRAVEN REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA 08 UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 UNKNOWN REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
  • All the voters missing last_name are REMOVED. Perhaps it’s a side-effect of the removal process.
# first name missing
d %>% 
  dplyr::filter(is.na(first_name)) %>% 
  dplyr::select(
    last_name, midl_name, name_sufx_cd, 
    sex, age,
    house_num, street_name, 
    voter_status_desc, voter_status_reason_desc
    ) %>% 
  dplyr::arrange(voter_status_desc, voter_status_reason_desc, midl_name) %>% 
  knitr::kable()
last_name midl_name name_sufx_cd sex age house_num street_name voter_status_desc voter_status_reason_desc
TRIANOSKY SUSAN SMITH NA FEMALE 46 2493 US HWY 221 ACTIVE CONFIRMATION PENDING
JOSEY BETTY NA FEMALE 61 1688 WALTERS MILL ACTIVE LEGACY DATA
PARRISH BRENDA NA FEMALE 59 461 SHADY GROVE ACTIVE LEGACY DATA
ROBINSON JACQUELINE P NA FEMALE 39 402 MAIN ACTIVE LEGACY DATA
UNDERWOOD REGINA NA FEMALE 46 130 CASWELL ACTIVE LEGACY DATA
JONES LARRY MALLOR NA JR MALE 98 0 RURAL RTE 5 ACTIVE LEGACY DATA
HOLMAN HOWARD NA MALE 41 402 AVALON ACTIVE UNVERIFIED NEW
YABIN NA NA MALE 53 2715 WESTERWOOD VILLAGE ACTIVE VERIFICATION PENDING
MORRIS ALEXANDER NA MALE 30 8118 WOODWAY OAK ACTIVE VERIFIED
BULLARD ALEXIS NA UNK 19 108 THIRD ACTIVE VERIFIED
ZIMMER CLIFFORD NA MALE 64 33 GREEN SPRINGS ACTIVE VERIFIED
CHESTER JAMES NA UNK 39 116 ASH LANDING ACTIVE VERIFIED
ALEXANDER JASON NA MALE 28 165 ARLEE ACTIVE VERIFIED
PATTERSON JOHN DEXTER III MALE 55 3707 PINECREST ACTIVE VERIFIED
MCKEEL LESTER NA MALE 77 3191 BIG DADDYS ACTIVE VERIFIED
FRISBY M JR MALE 33 145 CABARRUS GRAVES DORM ACTIVE VERIFIED
FUQUA MARY NA FEMALE 59 6757 NC HIGHWAY 62 S ACTIVE VERIFIED
MOLET MICHAEL NA MALE 26 629 PINE FOREST ACTIVE VERIFIED
KAUCHICK PAULINE NA FEMALE 26 5404 BAKERS MILL ACTIVE VERIFIED
FUQUA WILLIAM NA MALE 63 6757 NC HIGHWAY 62 S ACTIVE VERIFIED
WARREN NA JD MALE 68 43 J AND S ACTIVE VERIFIED
FRYE WILLIAM C NA II MALE 50 163 GATES FOREST ACTIVE VERIFIED
BURGESS NA NA FEMALE 29 0 HORACE PERRY ACTIVE VERIFIED
PHOENIX NA NA FEMALE 45 496 MIRACLE MOUNTAIN ACTIVE VERIFIED
JUDITH NA NA FEMALE 50 141 HILLSIDE ACTIVE VERIFIED
MALIK NA NA MALE 33 2517 ASHBY WOODS ACTIVE VERIFIED
ELSASS NA NA MALE 37 18924 RIVER FALLS ACTIVE VERIFIED
MAGENTA NA NA FEMALE 42 7324 WINDYRUSH ACTIVE VERIFIED
GRAYWOLF NA NA MALE 57 2402 OTIS ACTIVE VERIFIED
AMEN NA NA MALE 41 15 BERRYMEADOW ACTIVE VERIFIED
SILVERMOON NA NA FEMALE 40 5529 SUNLIGHT ACTIVE VERIFIED
PELKEY CHARES JR MALE 59 1395 EVA DENIED UNAVAILABLE ESSENTIAL INFORMATION
PITTS DARRYL NA MALE 19 1801 FAYETTEVILLE DENIED VERIFICATION RETURNED UNDELIVERABLE
LE SON NA NA UNK 35 5453 SHALLOWFORD DENIED VERIFICATION RETURNED UNDELIVERABLE
WHITFIELD KAY M NA NA FEMALE 79 819 POOR INACTIVE CONFIRMATION NOT RETURNED
MEDLIN ROBERT E NA FEMALE 0 0 UNKNOWN INACTIVE CONFIRMATION RETURNED UNDELIVERABLE
BRICE NA NA MALE 33 30 OREGON INACTIVE CONFIRMATION RETURNED UNDELIVERABLE
CAPARCO NA JEN FEMALE 33 326 ELM INACTIVE CONFIRMATION RETURNED UNDELIVERABLE
MORRISON NA NA MALE 34 9456 LEXINGTON INACTIVE CONFIRMATION RETURNED UNDELIVERABLE
BALLARD LEIGH NA FEMALE 29 200 BARNES REMOVED ADMINISTRATIVE
COTTEN NA NA FEMALE 83 0 SNOW HILL REMOVED ADMINISTRATIVE
SMITH NA NA UNK 0 0 LAWSON REMOVED ADMINISTRATIVE
0000000072294 NA NA MALE 46 1013 FREDERICK REMOVED ADMINISTRATIVE
ALSBROOKS ELEANOR NA FEMALE 90 98 COUNTRY CLUB REMOVED DECEASED
OXENDINE MITCHEL NA MALE 51 410 RURAL RTE 1 REMOVED DECEASED
WILLIS MOLLIE NA FEMALE 92 227 RURAL RTE 1 REMOVED DECEASED
ELLER RETA KATHLEE NA FEMALE 56 321 MARTIN REMOVED DECEASED
CHITTY RUBEN D FEMALE 98 0 ROUTE 1 REMOVED DECEASED
SELENE NA NA FEMALE 57 103 ORCHARD REMOVED DECEASED
LOWRY NA NA MALE 48 1515 ST ANNA REMOVED DECEASED
DE BRAGANZA NA NA MALE 93 2200 BROOKFIELD REMOVED DECEASED
MIDDLETON C NA FEMALE 84 223 MIDDLETON REMOVED DUPLICATE
BELL JAI-MIL NA FEMALE 23 0 WSSU REMOVED DUPLICATE
HWY LIBERA V MALE 69 0 UNKNOWN REMOVED DUPLICATE
OWENS MICHELLE NA FEMALE 24 524 BAKER REMOVED DUPLICATE
WILTON SUSAN LORRAINE NA FEMALE 50 105 EDGEHILL REMOVED DUPLICATE
ALLRED LINDA H NA NA FEMALE 66 3117 WENTWORTH REMOVED DUPLICATE
AMATO,KATHERINE,M NA NA FEMALE 50 1500 PLYMOUTH REMOVED DUPLICATE
AMIDON,PETER,LEVENT NA NA MALE 33 809 CROSSBOW REMOVED DUPLICATE
BEST,SYDNEY,ALLISON NA NA FEMALE 37 0 NA REMOVED DUPLICATE
BETHEA HAROLD LEE NA NA FEMALE 46 2011 MILL POND REMOVED DUPLICATE
BEVERLY CONSTANCE M NA NA FEMALE 37 316 NA REMOVED DUPLICATE
BOOZER ANNA KRISTEN NA NA FEMALE 36 0 NA REMOVED DUPLICATE
BOYD,ALLEN AUBREY,II NA NA MALE 35 0 NA REMOVED DUPLICATE
BRICE.MICHAEL ARTHUR NA NA MALE 37 3102 LAKEHURST REMOVED DUPLICATE
CARR,WENDELL,H JR NA NA MALE 36 524 KINGSTON REMOVED DUPLICATE
CATHEY,LONNIE,JR NA NA MALE 57 1507 NA REMOVED DUPLICATE
CLARK JOANNE BENNETT NA NA FEMALE 63 910 CHURCH REMOVED DUPLICATE
CUSTER,GEORGE D,JR NA NA MALE 51 7313 GOODWILL CHURCH REMOVED DUPLICATE
DAVID HYDE JR NA NA MALE 44 204 TUCSON REMOVED DUPLICATE
DAVISKMICHAEL EDWARD NA NA MALE 51 4411 CORNELL REMOVED DUPLICATE
DUBUISSON ALLISON B NA NA FEMALE 53 7713 THURSTON REMOVED DUPLICATE
FORRIS FAY ANN NA NA FEMALE 39 3005 CRESTBROOK REMOVED DUPLICATE
FULK,IVEY LEE,JR NA NA MALE 45 0 NA REMOVED DUPLICATE
GRIFFIN JANICE FAYE NA NA FEMALE 42 1603 EMMA REMOVED DUPLICATE
HALL,PONTHEOLA,M NA NA FEMALE 53 2110 YEARDLEYS REMOVED DUPLICATE
HANNER JO ANNE LONG NA NA FEMALE 61 4343 WELLS REMOVED DUPLICATE
HODNETT,DORGIE,JR NA NA MALE 52 223 FAULKNER REMOVED DUPLICATE
HOGSHEAD,THOMAS H,JR NA NA MALE 66 1108 EAST GREENWAY REMOVED DUPLICATE
JENKINS,JAMES W,JR NA NA MALE 36 5600 MELVIN REMOVED DUPLICATE
JONES,JOHNSIE,H NA NA FEMALE 92 3804 CHAMPION REMOVED DUPLICATE
KENNY MAHLON DAY NA NA MALE 84 18 KACIA REMOVED DUPLICATE
KEY,GENE SAMUEL,JR NA NA MALE 44 600 TABERNACLE CHURCH REMOVED DUPLICATE
LACKEY CAROL M NA NA FEMALE 70 5225 NA REMOVED DUPLICATE
LAMBERT DAVID M NA NA MALE 43 7726 OAKCLIFFE REMOVED DUPLICATE
LESANE JACQUELINE NA NA FEMALE 35 0 NA REMOVED DUPLICATE
MAPP,DWIGHT,BENJAMIN NA NA MALE 57 2422 PLEASANT HILL REMOVED DUPLICATE
MAY ROBERT BRYAN NA NA FEMALE 87 1614 TILLERY REMOVED DUPLICATE
MCCARTHY LISA ANNE NA NA FEMALE 44 5406 AGATHA REMOVED DUPLICATE
MICHELMJOSEPH JOHN NA NA MALE 40 1113 HENDERSON REMOVED DUPLICATE
NORTON MYRA WOODELL NA NA FEMALE 63 427 CASSELL REMOVED DUPLICATE
PEDIGO BUFORD T NA NA MALE 96 4315 NA REMOVED DUPLICATE
REDWINE MARK ALAN NA NA MALE 53 6 KEANSBURG REMOVED DUPLICATE
ROUSE,ESTHER, MAE NA NA FEMALE 52 611 NA REMOVED DUPLICATE
RUPOLO SANDRA NA NA FEMALE 36 517 TUCSON REMOVED DUPLICATE
SIMS,RAYMOND LEE,SR NA NA MALE 66 2209 NA REMOVED DUPLICATE
URQUHART PARK VASCO NA NA MALE 53 1108 HENDERSON REMOVED DUPLICATE
VALDEZ DONNA A NA NA FEMALE 43 4221 SCOUT REMOVED DUPLICATE
WALKER,CHARLES,JR NA NA MALE 56 339 NA REMOVED DUPLICATE
WESTMORELAND J C NA NA MALE 83 4621 CAMP BURTON REMOVED DUPLICATE
WHITAKER,JAMES L,JR NA NA MALE 35 2831 NA REMOVED DUPLICATE
WHITE,LEE E,JR NA NA FEMALE 35 0 NA REMOVED DUPLICATE
VAN DORSTEN NA NA FEMALE 105 3021 COUNTRY CLUB REMOVED DUPLICATE
BENSON EUGENE NA MALE 60 1525 MAIN REMOVED FELONY CONVICTION
STURDIVANT NA NA MALE 0 0 NO NAME REMOVED FELONY CONVICTION
STURDIVANT NA NA MALE 0 0 NO NAME REMOVED FELONY CONVICTION
BENTON BINARD NA FEMALE 46 8180 SCOTCH MEADOWS REMOVED MOVED FROM COUNTY
JACOBS HUTTO NA FEMALE 29 1415 KELLY REMOVED MOVED FROM COUNTY
HOLSHOUSER LOUISE NA FEMALE 23 291 FORD CREEK REMOVED MOVED FROM COUNTY
GREEN LYNN NA FEMALE 42 2033 HAMLET CHAPEL REMOVED MOVED FROM COUNTY
JOHNSON MICHELLE NA FEMALE 28 2506 NC 10 REMOVED MOVED FROM COUNTY
BLICK MOORE NA FEMALE 53 2947 8TH ST REMOVED MOVED FROM COUNTY
MORRISON SAIN NA FEMALE 52 4006 10TH AV REMOVED MOVED FROM COUNTY
BURGOYNE STEPHANIE A NA FEMALE 55 689 FILLGATE REMOVED MOVED FROM COUNTY
BARNES VALRIE NA FEMALE 56 410 12TH REMOVED MOVED FROM COUNTY
FEARS VANDERBILT JR MALE 45 5436 RIVER FALLS REMOVED MOVED FROM COUNTY
PINION WAYNE NA MALE 63 4015 NC 268 REMOVED MOVED FROM COUNTY
SKELTON WILLIAM III MALE 40 166 FRANKLIN REMOVED MOVED FROM COUNTY
RAINEY NA NA MALE 0 19 SPRUCE REMOVED MOVED FROM COUNTY
SKIA NA NA FEMALE 45 213 PALMER REMOVED MOVED FROM COUNTY
NA NA NA UNK 0 0 NA REMOVED MOVED FROM COUNTY
MAGENTA NA NA FEMALE 42 50 HOLLY HILL REMOVED MOVED FROM COUNTY
DE NA NA MALE 105 1124 FOURTH REMOVED MOVED FROM COUNTY
DE DEBORAH NA NA FEMALE 105 1124 FOURTH REMOVED MOVED FROM COUNTY
VAN EATON NA NA MALE 58 811 BRUCE REMOVED MOVED FROM COUNTY
TUIT NA NA MALE 21 980 DEHART COMM CENTER REMOVED MOVED FROM COUNTY
MARGO (ONLY NA FEMALE 62 119 HILT REMOVED MOVED FROM STATE
LEWIS BUZBY NA FEMALE 42 8218 HERTFORD REMOVED MOVED FROM STATE
RIVERS-MITCHELL TRINA SAGE NA FEMALE 30 402 KYLE REMOVED MOVED FROM STATE
BURNET UNNI KJOSNES NA FEMALE 72 4800 WELWYN REMOVED MOVED FROM STATE
HOCUTT CLAVON MORRIS NA NA MALE 58 178 ROUNTREE REMOVED MOVED FROM STATE
SEXTON NA NA FEMALE 53 2954 ORCHID REMOVED MOVED FROM STATE
ST JOHN NA NA FEMALE 44 6380 LAMSHIRE REMOVED MOVED FROM STATE
REARDON JOSEPH SR MALE 53 7122 TUCKASEEGEE REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
LOVE K NA MALE 81 426 TRYON REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
MASTON MELISSA CHAN NA FEMALE 34 0 CLARK REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
VON LOTHENHEIGER ROBIN NA FEMALE 43 3511 US 117 ALT REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
NA NA NA UNK 0 0 NA REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
HOLLOMAN NA R FEMALE 100 701 HIGH REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
BOSTIAN NA NA FEMALE 0 0 UNKNOWN REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
JOLLY NA NA FEMALE 98 1001 CRESCENT REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
GRAHAM GARLAND NA SR MALE 74 10 CAROLINA PINES MHP REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
KANTHI NA NA FEMALE 56 2211 NA REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
SCHAN NA NA FEMALE 35 332 LOWER GRASSY BRANCH REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
STEWART-WOODS MARY O NA NA FEMALE 53 218 NA REMOVED REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
BEST CHARLES RAY JR MALE 55 0 ROUTE 3 REMOVED REMOVED UNDER OLD PURGE LAW
KAAS EDWARD FRE MALE 76 112 5TH REMOVED REMOVED UNDER OLD PURGE LAW
DAVENPORT H NA FEMALE 98 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
BEAUDION JOHN NA MALE 50 162 COUTRY CLUB REMOVED REMOVED UNDER OLD PURGE LAW
GRAHAM JOHN NA MALE 85 0 RURAL RTE 65 REMOVED REMOVED UNDER OLD PURGE LAW
GUNTER LEE KLEIN NA FEMALE 43 0 PAST LECKA’S REMOVED REMOVED UNDER OLD PURGE LAW
DANIELS MARION NA MALE 90 126 MILL POND REMOVED REMOVED UNDER OLD PURGE LAW
WOOD NICOLE M NA FEMALE 58 525 1ST REMOVED REMOVED UNDER OLD PURGE LAW
BORIS ROBERT NA MALE 63 9211 HORNIGOLD REMOVED REMOVED UNDER OLD PURGE LAW
MOOREFIELD ROBERT STA MALE 49 1505 VILLAGE REMOVED REMOVED UNDER OLD PURGE LAW
JORDAN TERRA NA FEMALE 32 810 PO BOX REMOVED REMOVED UNDER OLD PURGE LAW
D’AIGNEAU TRACY ANN FEMALE 37 1201 SWORDFISH REMOVED REMOVED UNDER OLD PURGE LAW
NCT IS WRONG. SENT NA NA FEMALE 13 0 SCENIC REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
MV 5/17/95 NA NA MALE 89 0 MYERS CHAPEL REMOVED REMOVED UNDER OLD PURGE LAW
PRINCE ALICE KAY NA NA FEMALE 50 0 BYRDSVILLE MO HO PK REMOVED REMOVED UNDER OLD PURGE LAW
LSEWHERE I NA NA FEMALE 39 72 RURAL RTE 1 REMOVED REMOVED UNDER OLD PURGE LAW
MILES IRENE K NA NA FEMALE 100 400 RURAL RTE 1 REMOVED REMOVED UNDER OLD PURGE LAW
CARROLL NA NA FEMALE 58 20 ELLIOTT REMOVED REMOVED UNDER OLD PURGE LAW
STEPHENS JEFFRYN G NA NA FEMALE 61 112 ESTES REMOVED REMOVED UNDER OLD PURGE LAW
HENDERSON RAY MICH NA NA MALE 53 35 DAVIE REMOVED REMOVED UNDER OLD PURGE LAW
MENENDEZ-ZALACAIN NA NA FEMALE 58 1103 GREENSBORO REMOVED REMOVED UNDER OLD PURGE LAW
LASSITE NA NA MALE 54 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
MILLER JOHN KNOX NA NA MALE 83 604 TINKERBELL REMOVED REMOVED UNDER OLD PURGE LAW
DEL ROSSO FRANCES NA NA FEMALE 81 1407 THE OAKS APTS REMOVED REMOVED UNDER OLD PURGE LAW
MEEKER MICHAEL GAI NA NA MALE 58 0 ORANGE GROVE REMOVED REMOVED UNDER OLD PURGE LAW
PRICE INEZ KEETER NA NA FEMALE 72 0 RURAL RTE 2 REMOVED REMOVED UNDER OLD PURGE LAW
RUTT CHARLES E NA NA MALE 64 0 ORANGE GROVE REMOVED REMOVED UNDER OLD PURGE LAW
SUNDSTROM MARY BRE NA NA FEMALE 55 52 FLINT RIDGE APTS REMOVED REMOVED UNDER OLD PURGE LAW
PENDERGRAPH ADA W NA NA FEMALE 105 316 RURAL RTE 2 REMOVED REMOVED UNDER OLD PURGE LAW
WILKINS TERESA ELL NA NA FEMALE 50 0 COUNTRY SQUIRE MO HO PK REMOVED REMOVED UNDER OLD PURGE LAW
FERRETTIJ THOMAS A NA NA MALE 59 0 SR 1115 REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
BRADSHAW NA NA MALE 49 0 WITTY REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
XXX NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
X NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NEW TEST NA NA UNK 0 15 NO REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
ARRINGTON JULI NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA 08 UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 UNKNOWN REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
N NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
0 NA NA FEMALE 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
NA NA NA UNK 0 0 NA REMOVED REMOVED UNDER OLD PURGE LAW
  • Most are REMOVED, but some are ACTIVE and VERIFIED. That suggests the data entry for this record is done after verification.
  • Some appear to have the first name in the middle name field, e.g. (F M L) ("" “BRENDA” “PARRISH”), ("" “ALEXIS” “BULLARD”)
  • Some appear to have first and middle names appended to the last name, e.g. (F M L) ("" "" “JONES LARRY MALLOR”), ("" "" “AMATO,KATHERINE,M”)
  • Some are missing all the names!
  • Some appear to be test data, e.g. last name = XXX or “NEW TEST”

There are very few records missing first name or last name, and most of them are REMOVED status. The easiest thing to do is just get rid of those records.

Exclude records with missing first or last name

3.6.2 Check for lower-case letters.

d %>% dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "[a-z]"))
# A tibble: 50 x 1
   last_name
   <chr>    
 1 McCLURE  
 2 McCULLLEY
 3 DeNOON   
 4 DeSIMON  
 5 DeSIMON  
 6 DeVANE   
 7 DeVANE   
 8 LeMASTER 
 9 MaCDONELL
10 MaCDONELL
# … with 40 more rows
d %>% dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "[a-z]"))
# A tibble: 24 x 1
   first_name
   <chr>     
 1 JoANN     
 2 LaVERNE   
 3 BettyJEAN 
 4 JoANNE    
 5 LaWANDA   
 6 LaVAN     
 7 JoANN     
 8 LaDORA    
 9 JoANN     
10 SiROBERT  
# … with 14 more rows
d %>% dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "[a-z]"))
# A tibble: 169 x 1
   midl_name
   <chr>    
 1 McBRIDE  
 2 McBRIDE  
 3 McCLENNY 
 4 McLEAN   
 5 LaVERNE  
 6 McCLEASE 
 7 McDAY    
 8 McCOLLUM 
 9 McKINNIE 
10 McLAWHORN
# … with 159 more rows
  • 243 names with lower case letters.
  • Occur in last, first, and middle names.
  • Associated with particles where there would optionally be a space, e.g. De VANE.

Map all letters to upper case

3.6.3 Check for digits

d %>% dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "[0-9]"))
# A tibble: 90 x 1
   last_name   
   <chr>       
 1 HOLLERS  111
 2 GALL0WAY    
 3 MV 5/17/95  
 4 01          
 5 YARBOR0     
 6 J0HNSON     
 7 LEAK 111    
 8 BURT0N      
 9 REYN0LDS    
10 4MCMANUS    
# … with 80 more rows
d %>% dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "[0-9]"))
# A tibble: 81 x 1
   first_name
   <chr>     
 1 HERM0N    
 2 BL0SSIE   
 3 J0HN      
 4 J0HNNY    
 5 MAJ0R     
 6 J0NATHAN  
 7 J0SEPH    
 8 L0RI      
 9 LEPOLE0N  
10 J0 ELLEN  
# … with 71 more rows
d %>% dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "[0-9]"))
# A tibble: 299 x 1
   midl_name      
   <chr>          
 1 MIZELLE25248249
 2 0VERTON        
 3 111            
 4 RAY 1.         
 5 0DELL          
 6 OLLIE 111      
 7 ARGUS 4TH      
 8 3RD.           
 9 LYN451         
10 JAMES 111      
# … with 289 more rows
  • Zero substituted for O, e.g. J0HNSON, BURT0N
  • Some are obviously generation suffixes, e.g. ARGUS 4TH, LEAK 111 (should be LEAK III)
  • Some are poor parsing into fields, e.g. MV 5/17/95 , MIZELLE25248249

Look at the digits individually.

3.6.4 Check for zero

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "0"))
dim(x)
[1] 67  1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
 [1] "0"                 "0000000072294"     "01"               
 [4] "0WENS"             "ALEM0N"            "AN0Y0"            
 [7] "BISH0P"            "BOLAD0"            "BURT0N"           
[10] "C0NNOR"            "C0STNER"           "CAPUT0"           
[13] "CAUSIEESTK0-LEE"   "CL0NTZ"            "CLEMM0NS"         
[16] "CONN0R"            "CONR0Y"            "CR0NE"            
[19] "D0LLARS"           "D0WNS"             "DAWY0T"           
[22] "DIVINCENZ0"        "EAT0N"             "ESC0BEDO"         
[25] "FERGUS0N"          "FERNANDEZ-BRAV0"   "GALL0WAY"         
[28] "GOM0"              "GUARDAD0"          "HIGUER0-JAMES"    
[31] "J0HNSON"           "JOHNS0N"           "JORDAN-R0BERTS"   
[34] "KEAT0N"            "KOCH0NEAL"         "KONI0R"           
[37] "L0CKLEAR"          "MCC0Y"             "MCD0UGAL"         
[40] "ND0H"              "OCONN0R"           "P0RTER"           
[43] "P0WERS"            "PEREZ-NAVARR0"     "PULL0"            
[46] "R0CCANOVA"         "R0CCO"             "R0DRIGUEZ"        
[49] "REYN0LDS"          "ROSK0S-SHAMBERGER" "RUSS0"            
[52] "SAMARG0"           "SCAMARD0"          "SIMPS0N"          
[55] "SOLTER0"           "SOOTO0"            "ST0LTZ"           
[58] "TANHEHC0"          "TAYL0R"            "THOMPS0N"         
[61] "WINST0N"           "WIT0SKY"           "WO0DARD"          
[64] "YARBOR0"           "YATSK0"           
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "0"))
dim(x)
[1] 73  1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
 [1] "0"           "ALLIS0N"     "ALONZ0"      "ANDREA-0"    "ANTONI0"    
 [6] "AZAVI0US"    "B0BBY"       "B0NNIE"      "B0YCE"       "BL0SSIE"    
[11] "C0LBY"       "C0RDELIA"    "CAR0LE"      "CAR0LYN"     "CHERYL0N"   
[16] "CHRIST0PHER" "D0LORES"     "D0NNA"       "DELI0"       "DONNA CAR0" 
[21] "DOR0THY"     "GREG0RY"     "HERM0N"      "J0"          "J0 ANN"     
[26] "J0 ELLEN"    "J0AN"        "J0HN"        "J0HNNY"      "J0NATHAN"   
[31] "J0SEPH"      "JONATH0N"    "K0LTON"      "KAR0N"       "L0RI"       
[36] "L0UIZETTA"   "LEPOLE0N"    "M0NICA"      "M0NIKA"      "MAJ0R"      
[41] "MARI0N"      "MARY-J0"     "MICHAEL TR0" "NAT0SHA"     "ORLAND0"    
[46] "OTH0"        "P0LLY"       "PLACID0"     "R0BERT"      "R0Y"        
[51] "REYNALD0"    "RODRIG0"     "S0NTE"       "SHANN0N"     "T0NYA"      
[56] "TIM0THY"     "V0NCIEAL"    "Y0LANDA"    
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "0"))
dim(x)
[1] 130   1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
 [1] "0"               "0  CYRUS"        "0'BRIAN"         "0'CONNOR"       
 [5] "0ATES"           "0DEL"            "0DELL"           "0MAE"           
 [9] "0ROURKE"         "0VERTON"         "10052004"        "103"            
[13] "205"             "2205"            "8017"            "ALEXANDER080572"
[17] "ALPHONS0"        "ANDERSON9104576" "ANN B0YD"        "ANTH0NY"        
[21] "AY0"             "BA0-KUO"         "C1010"           "CO0PER"         
[25] "COL0N"           "CR0XIN"          "D0N"             "D0RIS"          
[29] "D0UGLAS"         "DALE401"         "DEANGEL0"        "DEV0NA"         
[33] "DIO0NE"          "DON0HOO"         "EDWARDS1801"     "ELAINE1000"     
[37] "ELLI0TT"         "EMETRIC0"        "EN0"             "F0REST"         
[41] "FINLEY500 SU"    "FRANT0NIO"       "H0USTON"         "J0"             
[45] "J0 MARINOVIC"    "J0E"             "J0HN"            "J0NES"          
[49] "JONATH0N"        "JOYCE701"        "JUNI0R"          "L0CKAMY"        
[53] "L0UISE"          "LAM0ND"          "LAT0NYA"         "LAV0NE"         
[57] "LE0N"            "LEE3708"         "LORENZ0"         "LOUIS7100"      
[61] "LY0NS"           "LYNN1820"        "M00RE"           "M0NGE"          
[65] "M0NIQUE"         "M0RALES"         "MARIE103062"     "NICH0LE"        
[69] "NICH0LS"         "OCONN0R"         "ORLAND0"         "P0RTER"         
[73] "PESATUR0"        "R0BERT"          "R0CHELLE"        "R0DGERS"        
[77] "R0Y"             "ROBINS0N"        "ROSENBAUM3305"   "RUNY0N"         
[81] "SAMBRAN0"        "SC0TT"           "SCOTT3450"       "SH0RROD"        
[85] "T0DD"            "T0NY"            "TAYL0R"          "TH0MPSON"       
[89] "TOME0"           "V0SS"            "VALENTIN0"       "W00LARD"        
[93] "WAYNE030986"     "WRIGHT2106"      "Y0LONDA"         "Y0UNG"          
  • 270 names with zero
  • Occur in last, first, and middle names.
  • Most are zero substituted for O, e.g. J0HNSON, BURT0N
  • Some are pure numeric, e.g. 0, 01
  • Some are names with concatenated numeric, e.g. WAYNE030986, WRIGHT2106

Map zero to O if name contains at least one letter and no digits 1-9

3.6.5 Check for one

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "1"))
dim(x)
[1] 20  1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
 [1] "01"              "1"               "491715"          "971"            
 [5] "CARR  111"       "CHASTAIN 11"     "CLARK 111"       "COMER 111"      
 [9] "COX  1V"         "HINES 111"       "HOLLERS  111"    "LATTA 111"      
[13] "LEAK 111"        "MELTON 111"      "MV 5/17/95"      "PEELE 11"       
[17] "SATTERFIELD 111" "SPATCHER 111"    "TUCKER  11"      "WASHINGTON 111" 
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "1"))
dim(x)
[1] 3 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
[1] "DAVID 111" "ELIZABE1H" "ROSE1"    
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "1"))
dim(x)
[1] 163   1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
 [1] "10052004"        "103"             "11"              "111"            
 [5] "1V"              "8017"            "A 111"           "ANDERSON9104576"
 [9] "ANN155"          "B 11"            "B 111"           "C 111"          
[13] "C1010"           "D 11"            "DALE401"         "EDWARDS1801"    
[17] "ELAINE1000"      "EUGENE 11"       "FRANCIS 11"      "FRANKLIN 1V"    
[21] "H 11"            "H 111"           "HODGES 111"      "HOUSTON 11"     
[25] "HOYLE 111"       "J1-TO"           "JAMES 111"       "JONA1"          
[29] "JOYCE701"        "LOUIS7100"       "LYN451"          "LYNN1820"       
[33] "LYNN2513"        "M 111"           "M1"              "MARIE103062"    
[37] "MARION 111"      "MASON 111"       "MICHAEL146"      "N 111"          
[41] "NADINE DOUGLAS1" "OLLIE 111"       "RANDOLPH 111"    "RAY 1."         
[45] "ROYAL 111"       "T 111"           "THOMAS 111"      "VERNON 111"     
[49] "W 111"           "WILLIAM 11"      "WILLIAM 111"     "WILLIAM1"       
[53] "WM 111"          "WRIGHT2106"     
  • 186 names with one
  • Occur in last, first, and middle names.
  • Most are 1 substituted for I in generation suffix, e.g. COX 1V, CARR 111
  • Some are pure numeric, e.g. 01, 971
  • Some are wrongly parsed, e.g. MV 5/17/95
  • Some are names with concatenated numeric, e.g. LYNN2513, WRIGHT2106

Delete generation suffixes where possible

3.6.6 Check for two

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "2"))
dim(x)
[1] 1 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
[1] "0000000072294"
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "2"))
dim(x)
[1] 1 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
[1] "MICHAEL DEAN 2"
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "2"))
dim(x)
[1] 13  1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
 [1] "10052004"        "205"             "2205"            "328"            
 [5] "4625"            "4932"            "ALEXANDER080572" "B2957"          
 [9] "LYNN1820"        "LYNN2513"        "MARIE103062"     "MIZELLE25248249"
[13] "WRIGHT2106"     
  • 15 names with two
  • Some are pure numeric, e.g. 205, 328
  • Some are names with concatenated numeric, e.g. LYNN1820, WRIGHT2106

3.6.7 Check for three

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "3"))
dim(x)
[1] 1 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
[1] "3"
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "3"))
dim(x)
[1] 0 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
character(0)
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "3"))
dim(x)
[1] 13  1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
 [1] "103"           "328"           "3RD."          "4932"         
 [5] "LEE3708"       "LYNN2513"      "MACK 3RD"      "MARIE103062"  
 [9] "MITCHELL368"   "ROSENBAUM3305" "SANFORD-3"     "SCOTT3450"    
[13] "WAYNE030986"  
  • 14 names with three
  • Some are pure numeric, e.g. 103, 328
  • Some are generation suffixes, e.g. 3RD., MACK 3RD
  • Some are names with concatenated numeric, e.g. LEE3708, SCOTT3450

3.6.8 Check for four

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "4"))
dim(x)
[1] 3 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
[1] "0000000072294" "491715"        "4MCMANUS"     
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "4"))
dim(x)
[1] 1 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
[1] "FR4ANK"
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "4"))
dim(x)
[1] 15  1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
 [1] "10052004"        "4625"            "4932"            "ANDERSON9104576"
 [5] "ANN BURTON47"    "ARGUS 4TH"       "DALE401"         "JAM4S"          
 [9] "LYN451"          "MCREE 4"         "MICHA4EL"        "MICHAEL146"     
[13] "MIZELLE25248249" "SCOTT3450"       "TE4S"           
  • 19 names with four
  • Some are pure numeric, e.g. 4625, 4932
  • Some are generation suffixes, e.g. ARGUS 4TH, MCREE 4
  • Some are names with concatenated numeric, e.g. DALE401, SCOTT3450
  • Some are intrusions in names, e.g. FR4ANK, MICHA4EL

3.6.9 Check for five

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "5"))
dim(x)
[1] 3 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
[1] "491715"     "ALBER5TSON" "MV 5/17/95"
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "5"))
dim(x)
[1] 0 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
character(0)
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "5"))
dim(x)
[1] 17  1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
 [1] "(NMN)5TH"        "10052004"        "205"             "2205"           
 [5] "4625"            "ALEXANDER080572" "ANDERSON9104576" "ANN155"         
 [9] "B2957"           "FINLEY500 SU"    "LUTHER5"         "LYN451"         
[13] "LYNN2513"        "MIZELLE25248249" "ROSENBAUM3305"   "SCOTT3450"      
[17] "W5RAY"          
  • 20 names with five
  • Some are pure numeric, e.g. 205, 2205
  • Some are generation suffixes, e.g. (NMN)5TH
  • Some are names with concatenated numeric, e.g. DALE401, SCOTT3450
  • Some are wrongly parsed, e.g. MV 5/17/95
  • Some are intrusions in names, e.g. FR4ANK, MICHA4EL
  • Some are substitution of 5 for S e.g. ALBER5TSON

3.6.10 Check for six

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "6"))
dim(x)
[1] 1 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
[1] "6"
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "6"))
dim(x)
[1] 1 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
[1] "RETT6A"
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "6"))
dim(x)
[1] 7 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
[1] "4625"            "ANDERSON9104576" "MARIE103062"     "MICHAEL146"     
[5] "MITCHELL368"     "WAYNE030986"     "WRIGHT2106"     
  • 9 names with six
  • Some are pure numeric, e.g. 6, 4625
  • Some are names with concatenated nmeric, e.g. MICHAEL146, MICHAEL146

3.6.11 Check for seven

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "7"))
dim(x)
[1] 4 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
[1] "0000000072294" "491715"        "971"           "MV 5/17/95"   
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "7"))
dim(x)
[1] 0 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
character(0)
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "7"))
dim(x)
[1] 8 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
[1] "8017"            "ALEXANDER080572" "ANDERSON9104576" "ANN BURTON47"   
[5] "B2957"           "JOYCE701"        "LEE3708"         "LOUIS7100"      
  • 12 names with seven
  • Some are pure numeric, e.g. 491715, 971
  • Some are names with concatenated numeric, e.g. DALE401, SCOTT3450
  • Some are wrongly parsed, e.g. MV 5/17/95
  • Some are intrusions in names, e.g. JOYCE701, LOUIS7100

3.6.12 Check for eight

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "8"))
dim(x)
[1] 0 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
character(0)
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "8"))
dim(x)
[1] 2 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
[1] "BEA LOUI8" "J8IMMIE"  
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "8"))
dim(x)
[1] 9 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
[1] "328"             "8017"            "ALEXANDER080572" "EDWARDS1801"    
[5] "LEE3708"         "LYNN1820"        "MITCHELL368"     "MIZELLE25248249"
[9] "WAYNE030986"    
  • 11 names with eight
  • Some are pure numeric, e.g. 328, 8017
  • Some are names with concatenated numeric, e.g. LEE3708, LYNN1820
  • Some are intrusions in names, e.g. J8IMMIE
  • Some might be substitution of 8 for SE e.g. BEA LOUI8

3.6.13 Check for nine

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "9"))
dim(x)
[1] 4 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
[1] "0000000072294" "491715"        "971"           "MV 5/17/95"   
x <- d %>% 
  dplyr::select(first_name) %>%
  dplyr::filter(stringr::str_detect(first_name, "9"))
dim(x)
[1] 0 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(first_name) %>% 
  dplyr::pull(first_name)
character(0)
x <- d %>% 
  dplyr::select(midl_name) %>%
  dplyr::filter(stringr::str_detect(midl_name, "9"))
dim(x)
[1] 6 1
x %>%   
  dplyr::distinct() %>% 
  dplyr::arrange(midl_name) %>% 
  dplyr::pull(midl_name)
[1] "4932"            "ANDERSON9104576" "B2957"           "LO9UIS"         
[5] "MIZELLE25248249" "WAYNE030986"    
  • 10 names with nine
  • Some are pure numeric, e.g. 971, 4932
  • Some are names with concatenated numeric, e.g. ANDERSON9104576, WAYNE030986
  • Some are wrongly parsed, e.g. MV 5/17/95
  • Some are intrusions in names, e.g. LO9UIS

3.6.14 Check for whitespace.

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "\\s"))
dim(x)
[1] 13637     1
x %>%   
  dplyr::slice_head(n = 100) %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
  [1] "ABD SHAKUR"         "ABD SHAKUR"         "AL HUSSAINA"       
  [4] "ARNOLD DEW"         "BENDER JR"          "DA SILVA"          
  [7] "DA SILVA"           "DA SILVA"           "DE BRADY"          
 [10] "DEL MAURO"          "DEL ROSARIO"        "DES JARDINS"       
 [13] "DI LORENZO"         "DU BOIS"            "HOLLERS  111"      
 [16] "KROMIS BRESNIHAN"   "LA MOTTE"           "LAMBERT JR"        
 [19] "LE BLANC"           "LE FEVER"           "LE MAY"            
 [22] "MAC CRINDLE"        "MAC DONALD"         "MAC DOWELL"        
 [25] "MAC DOWELL"         "MC ANIFF"           "MC ANIFF"          
 [28] "MC CADEN"           "MC CADEN"           "MC CADEN"          
 [31] "MC CADEN"           "MC CADEN"           "MC CADEN"          
 [34] "MC CADEN"           "MC COY"             "MC COY"            
 [37] "MC COY"             "MC COY"             "MC CRAY"           
 [40] "MC GARR"            "MC GHEE"            "MC GHEE"           
 [43] "MC GHEE"            "MC GUIRE"           "MC MANNEN"         
 [46] "MC MULLEN"          "MC NAIR"            "MCMILLIAN   (MUMFO"
 [49] "MCQUEEN   (MORRISE" "MILLS- KHARBAT"     "NCT IS WRONG. SENT"
 [52] "O BRIEN"            "O HARA"             "O NEAL"            
 [55] "O NEAL"             "PARISH    (RAMON)"  "REDFEARN- SHELTON" 
 [58] "ST CLAIR"           "ST CLAIR"           "ST CLAIR"          
 [61] "ST CLAIR"           "ST CLAIR"           "ST CLAIR"          
 [64] "ST CLAIR"           "ST LOUIS"           "ST ONGE"           
 [67] "ST PIERR"           "ST SING"            "ST SING"           
 [70] "ST SING"            "SYKES (BRICKHOUSE)" "TIPTON- BARNARD"   
 [73] "VAN BALEN"          "VAN BUSKIRK"        "VAN DEVENTER"      
 [76] "VAN DONSEL"         "VAN DORPE"          "VAN DYKE"          
 [79] "VAN DYKE"           "VAN ETTEN"          "VAN HORN"          
 [82] "VAN HORN"           "VAN HORN"           "VAN HORN"          
 [85] "VAN LOTON"          "VAN MEIR"           "VAN SCHOLK"        
 [88] "VAN SUTPHIN"        "VAN ZANDLE"         "VAN ZANDLE"        
 [91] "VANDER STOKKER"     "VON BIBERSTEIN"     "VON BIBERSTEIN"    
 [94] "VON BIBERSTEIN"     "VON BIBERSTEIN"     "VON BIBERSTEIN"    
 [97] "WATTS ST PIERREE"   "WHITFIELD KAY M"    "YELLOW ROBE"       
[100] "YELLOW ROBE"       
  • ~7k names with whitespace
  • Some whitespace is because of prefixes, e.g. DE COSTA, VAN DYKE
  • Some whitespace is instead of a hyphen, e.g. BROWN MAY, JONES MBIYA
  • Some whitespace is incorrectly inserted, e.g LI NDSEY
  • Some whitespace is probably variable between people, e.g. MC ADAMS, MC INTOSH
  • Some whitespace is instead of a single quote, e.g. O KELLY, O NEAL

Map whitespace to empty string

3.6.15 Check for hyphens

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "-"))
dim(x)
[1] 34325     1
x %>%   
  dplyr::slice_head(n = 100) %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
  [1] "AB-HUGH"           "AB-HUGH"           "ABDUL-GHAFFAR"    
  [4] "ABDUL-GHAFFAR"     "ABDUL-KARRIEM"     "ABDUL-RABB"       
  [7] "ABDUL-RAHIM"       "ABDUL-RAHIN"       "ABDUL-RAHMAN"     
 [10] "ABDUL-SALAAM"      "ABDUL-SALAM"       "ABDUL-WAHID"      
 [13] "ABDUR-RAHIM"       "ABDUR-RAHMAN"      "ABU-DAMES"        
 [16] "ABU-SABA"          "ABU-SABA"          "ABU-SABA"         
 [19] "ADAMS-CASKIE"      "ADAMS-MYERS"       "AFRICA-FLOYD"     
 [22] "AL-AWAR"           "AL-AWAR"           "AL-AWAR"          
 [25] "AL-KURDI"          "AL-SAADI"          "AL-SAADI"         
 [28] "ALBERT-KEULAN"     "ALSTON-EATMON"     "ANDERSON-TESH"    
 [31] "APPLEWHITE-LEWIS"  "ARDITO-BARLETTA"   "ARMSTRONG-VANN"   
 [34] "ARTHUR-CORNETT"    "ASKINS-MYRICK"     "AWTREY-KIRKMAN"   
 [37] "BAILEY-BROOKS"     "BARNARD-BAILEY"    "BENNETT-CLOWNEY"  
 [40] "BENTLEY-HALE"      "BIBB-FREEMAN"      "BLAKE-HASKINS"    
 [43] "BLEKFELD-SZTRAKY"  "BLEVINS-SPRINKLE"  "BLUE-SWANN"       
 [46] "BRADY-WILSON"      "BROWN-CORNELIUS"   "BRUCE-ROSS"       
 [49] "BUCKLEY-MOORE"     "CLARK-BARKER"      "CLAUDIO-DIAZ"     
 [52] "CLAUDIO-DIAZ"      "CLAUDIO-DIAZ"      "CLAUDIO-DIAZ"     
 [55] "COLE-MORGAN"       "CROWELL-SMITH"     "DAVIS-BOYD"       
 [58] "DAVIS-PARKER"      "DAVIS-ROBINSON"    "DUFFER-LEECHFORD" 
 [61] "EATON-ALSTON"      "ELLIS-WALLACE"     "ENGEL-BAKER"      
 [64] "GILLIS-HENDELL"    "GORDON-WICKER"     "GREEN-HOLLEY"     
 [67] "GUPTA-THOMAS"      "HARGETT-LILLY"     "HIATT-CRIBBS"     
 [70] "JONES-ALEXANDER"   "JONES-SUTTON"      "KELLER-HULL"      
 [73] "KOSKI-PONTON"      "KUCERA-HOFFMANN"   "LAWS-GRIFFIN"     
 [76] "LEARY-SMITH"       "LIDE-GRANT"        "LITTON-MCKENZIE"  
 [79] "LOCKLEAR-CASEY"    "LOCKLEAR-CRABTREE" "MANESS-LITTLE"    
 [82] "MAYNOR-BOWEN"      "MILLS- KHARBAT"    "MURPHY-GRAY"      
 [85] "PARKER-LOWE"       "PARRA-ASH"         "POOLE-JENKINS"    
 [88] "POPISH-SMITH"      "RAY-LEAZER"        "REDFEARN- SHELTON"
 [91] "RIDDICK-HARRELL"   "RIVERA-MONTORO"    "SEVORES-AMMONS"   
 [94] "SORRELLS-COOPER"   "STEPHENS-HORTON"   "TIPTON- BARNARD"  
 [97] "TOMBLIN-WELLMAN"   "WALLIS-JOHNSON"    "WATKINS-AKERS"    
[100] "WHITAKER-LINDSAY" 
  • ~21k names with hyphens
  • Look like legitimately hyphenated names

Map hyphen to empty string

3.6.16 Check for single quotes

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "'"))
dim(x)
[1] 9712    1
x %>%   
  dplyr::distinct() %>% 
  dplyr::slice_head(n = 100) %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
  [1] "BOURR'E"           "BOVE'"             "D'ALPHE"          
  [4] "D'AMBROSIO"        "D'AMICO"           "D'ANGELO"         
  [7] "D'ANGIO"           "D'ANNUNZIO"        "D'ANTIGNAC"       
 [10] "D'ARCO"            "D'ARMOND"          "D'ARVILLE"        
 [13] "D'ASCOLI"          "D'AUGUSTA"         "D'AURIA"          
 [16] "D'AUTRECHY"        "D'AVANZO"          "D'EMPAIRE"        
 [19] "D'ERCOLE"          "D'HEMECOURT"       "D'IGNAZIO"        
 [22] "D'INDIA"           "D'ONOFRIO"         "D'SANT"           
 [25] "DEBELL-O'NEAL"     "DEL RE'"           "DELL'OSSO"        
 [28] "DUARTE'"           "L'ETOILE"          "L'HUILLIER"       
 [31] "LACHARITE'-OTWELL" "O' NEAL"           "O'BANION"         
 [34] "O'BANNON"          "O'BERRY"           "O'BRIAN"          
 [37] "O'BRIANT"          "O'BRIEN"           "O'BRYAN"          
 [40] "O'BRYANT"          "O'BRYON"           "O'BYRNE"          
 [43] "O'CARROLL"         "O'CONNEL"          "O'CONNELL"        
 [46] "O'CONNER"          "O'CONNOR"          "O'CONWELL"        
 [49] "O'DANIEL"          "O'DEA"             "O'DEAR"           
 [52] "O'DEAR BROOKS"     "O'DELL"            "O'DOM"            
 [55] "O'DONALD"          "O'DONNEL"          "O'DONNELL"        
 [58] "O'DRISCOLL"        "O'FARRELL"         "O'FERRELL"        
 [61] "O'GARA"            "O'GEARY"           "O'GRADY"          
 [64] "O'GUIN"            "O'GWYNN"           "O'HARA"           
 [67] "O'HERN"            "O'KANE"            "O'KEEFE"          
 [70] "O'KELLEY"          "O'KELLY"           "O'KONEK"          
 [73] "O'LAUGHLIN"        "O'LEARY"           "O'MAHONY"         
 [76] "O'MARA"            "O'NEAL"            "O'NEAL-BIGGS"     
 [79] "O'NEAL-CLEMENTS"   "O'NEAL-WRIGHT"     "O'NEIL"           
 [82] "O'NEILL"           "O'PHARROW"         "O'QUIN"           
 [85] "O'QUINN"           "O'REAR"            "O'REILLY"         
 [88] "O'RILEY"           "O'RORK"            "O'ROUKE"          
 [91] "O'ROURKE"          "O'SHAUGHNESSY"     "O'SHEA"           
 [94] "O'SHIELD"          "O'SHIELDS"         "O'STEEN"          
 [97] "O'SULLIVAN"        "O'TOOLE"           "O'TUEL"           
[100] "SOLLE'"           
  • ~5k names with single quotes
  • Most look like correct names, e.g. O’NEAL, D’AGOSTINO
  • Some are suspect, e.g. BONE’, BOVA’

Map single quote to empty string

3.6.17 Check for double quotes

d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "\""))
# A tibble: 1 x 1
  last_name
  <chr>    
1 "LA\"BEE"
  • 1 name with double quotes
  • The backslash in LA"BEE was probably inserted automatically when the data was exported
  • That name probably should have been LA’BEE

Map all double quotes to single quotes

3.6.18 Check for other characters (1)

x <- d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "[^-\\s'\"a-zA-Z]"))
dim(x)
[1] 294   1
x %>%   
  dplyr::distinct() %>% 
  dplyr::slice_head(n = 100) %>% 
  dplyr::arrange(last_name) %>% 
  dplyr::pull(last_name)
  [1] ";PGEMAN"              "01"                   "3"                   
  [4] "491715"               "4MCMANUS"             "AMATO,KATHERINE,M"   
  [7] "AMIDON,PETER,LEVENT"  "AN0Y0"                "BAKER   (MCFADYEN)"  
 [10] "BAREFOOT   (RHINE)"   "BELL,MITCHELL THOMAS" "BEST,SYDNEY,ALLISON" 
 [13] "BINGHAM JR."          "BOYD,ALLEN AUBREY,II" "BRICE.MICHAEL ARTHUR"
 [16] "BRINKLEY/BAGGS"       "BRITTAIN/SPRINKLE"    "BROWN,FREDERIC CHEST"
 [19] "BROWN,ROBERT EDWARD," "BUCHANAN,SAMMY JOE,J" "BUNTON,RAYMOND AVNEY"
 [22] "BURGESS,WINFRED LEE," "BURNETTE,TOMMY WILLI" "BURT0N"              
 [25] "BURWELL JR."          "BYRD/ROBERTS"         "CARR  111"           
 [28] "CARR,WENDELL,H JR"    "CARSON   (WADE)"      "CASH/GODWIN"         
 [31] "CATHEY,LONNIE,JR"     "CHASTAIN 11"          "COLLINS (SISTER)"    
 [34] "COMER 111"            "COTHERN  (BLAKE)"     "COX  1V"             
 [37] "EDENS   (ARCHAMBAU"   "EVANS  (ABBOTT)"      "FEE (SISTER)"        
 [40] "FORTNER,II"           "FOSTER   (KING)"      "GALL0WAY"            
 [43] "GARNER/MCGRAW"        "HINES 111"            "HOLLERS  111"        
 [46] "HUDSON  (HALL)"       "J0HNSON"              "JORDAN-R0BERTS"      
 [49] "KEAT0N"               "KINLAW  (GUIN)"       "LAIL/OXENTINE"       
 [52] "LEAK 111"             "LYTLE/FORNEY"         "MCC0Y"               
 [55] "MCCLAIN (SISTER)"     "MCDONOUGH (SISTER)"   "MCMILLIAN   (MUMFO"  
 [58] "MCQUEEN   (MORRISE"   "MOCCIA  (SMITH)"      "MOORING,MOLLY"       
 [61] "MORRIS/BLOOM"         "MV 5/17/95"           "NCT IS WRONG. SENT"  
 [64] "NICHOLS  (NORTON)"    "NICHOLS/BROWN"        "O;NEAL"              
 [67] "O`BRIANT"             "PALMER(BRIGGS)"       "PARISH    (RAMON)"   
 [70] "PUCKETT`"             "RAMSEY/DOBERT"        "REYN0LDS"            
 [73] "RHONEY/PETERS"        "RIDGWAY;"             "ROGERS,JR."          
 [76] "SIDI/HIDA"            "SIMPS0N"              "SMELT/PEARSON"       
 [79] "SMITH/COOPER"         "SPATCHER 111"         "ST. CLAIR"           
 [82] "ST. DENIS"            "ST. GEORGE"           "ST. LAWRENCE"        
 [85] "ST.CLAIR"             "ST.GEORGE"            "ST.GERMAINE"         
 [88] "STUTLER/JAGGERS"      "SWYGERT/SMITH"        "SYKES (BRICKHOUSE)"  
 [91] "TRIVETTE JR."         "TUCKER  11"           "VALKENAAR   ."       
 [94] "WATERS/CRUZ"          "WEATHERINGTON,III"    "WILSON JR."          
 [97] "WO0DARD"              "WOODARD/YANTES"       "WOODARD`"            
[100] "YARBOR0"             
  • 73 rows with other characters
  • five zero period back-tick slash back-slash asterisk comma tilde one percent underscore

Look at those in more detail.

3.6.19 Check for period

d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "\\."))
# A tibble: 44 x 1
   last_name         
   <chr>             
 1 NCT IS WRONG. SENT
 2 ROGERS,JR.        
 3 VALKENAAR   .     
 4 ST.GEORGE         
 5 ST. GEORGE        
 6 BINGHAM JR.       
 7 ST. LAWRENCE      
 8 WILSON JR.        
 9 TRIVETTE JR.      
10 ST. CLAIR         
# … with 34 more rows
  • 11 names with period
  • Most are legitimate abbreviation of SAINT although spacing is inconsistent, e.g. ST.JOHN, ST. JOHN
  • Some are legitimate abbreviation of Junior, which should be in the name_sufx_cd field.

Map period to empty string

Move suffix to suffix field

3.6.20 Check for comma

d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, ","))
# A tibble: 63 x 1
   last_name           
   <chr>               
 1 AMATO,KATHERINE,M   
 2 AMIDON,PETER,LEVENT 
 3 ROGERS,JR.          
 4 BELL,MITCHELL THOMAS
 5 BEST,SYDNEY,ALLISON 
 6 WEATHERINGTON,III   
 7 BOYD,ALLEN AUBREY,II
 8 BROWN,FREDERIC CHEST
 9 FORTNER,II          
10 BROWN,ROBERT EDWARD,
# … with 53 more rows
  • 2 names with comma
  • Both are when a suffix has been incorrectly included in last_name

Map comma to empty string

Move suffix to suffix field

3.6.21 Check for asterisk

d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "\\*"))
# A tibble: 7 x 1
  last_name
  <chr>    
1 O*TOOLE  
2 O*TOOLE  
3 O*NEAL   
4 O*MASTERS
5 D*AMICO  
6 D*AMICO  
7 O*BRIEN  
  • 7 names with asterisk
  • Asterisk substituted for single quote, e.g. OMASTERS, DAMICO

Map asterisk to empty string

3.6.22 Check for slash

d %>% 
  dplyr::select(last_name, first_name, sex) %>%
  dplyr::filter(stringr::str_detect(last_name, "/"))
# A tibble: 46 x 3
   last_name         first_name sex   
   <chr>             <chr>      <chr> 
 1 GARNER/MCGRAW     JOANN      FEMALE
 2 RHONEY/PETERS     DONNA      FEMALE
 3 MV 5/17/95        <NA>       MALE  
 4 SIDI/HIDA         DEBORAH    FEMALE
 5 STUTLER/JAGGERS   MELANIE    FEMALE
 6 MORRIS/BLOOM      TERESA     FEMALE
 7 BRINKLEY/BAGGS    MICHELLE   FEMALE
 8 RAMSEY/DOBERT     AMY        FEMALE
 9 WATERS/CRUZ       ELIZABETH  FEMALE
10 BRITTAIN/SPRINKLE SANDRA     FEMALE
# … with 36 more rows
  • 13 names with slash
  • Being used equivalently to hyphen. (The fact that they are all female suggests they might be women hyphenating their names on marriage.)

Map slash to empty string

3.6.23 Check for backslash

d %>% 
  dplyr::select(last_name, first_name, sex) %>%
  dplyr::filter(stringr::str_detect(last_name, "\\\\"))
# A tibble: 4 x 3
  last_name    first_name sex   
  <chr>        <chr>      <chr> 
1 "PUTNAM\\"   TAMARA     FEMALE
2 "STRTHEIT\\" LOLA       FEMALE
3 "BUFFKIN\\"  WESLEY     MALE  
4 "GOSHEN\\"   DIXIE      FEMALE
  • 3 names with backslash
  • No obvious reason for inclusion

Map backslash to empty string

3.6.24 Check for back-tick

d %>% 
  dplyr::select(last_name, first_name, sex) %>%
  dplyr::filter(stringr::str_detect(last_name, "`"))
# A tibble: 10 x 3
   last_name first_name sex   
   <chr>     <chr>      <chr> 
 1 O`BRIANT  DIANE      FEMALE
 2 O`BRIANT  WILLIAM    MALE  
 3 WOODARD`  JASON      MALE  
 4 PUCKETT`  LEANDRA    FEMALE
 5 BRYANT`   WILLIAM    MALE  
 6 GODWIN`   PATRICIA   FEMALE
 7 MORRISON` HAZEL      FEMALE
 8 BOYLES`   LINDA      FEMALE
 9 HARRISON` TRACI      FEMALE
10 CASEY`    LONNIE     MALE  
  • 4 names with back-tick
  • No obvious reason for inclusion

Map back-tick to empty string

3.6.25 Check for tilde

d %>% 
  dplyr::select(last_name, first_name, sex) %>%
  dplyr::filter(stringr::str_detect(last_name, "~"))
# A tibble: 1 x 3
  last_name      first_name sex   
  <chr>          <chr>      <chr> 
1 O~CONNOR-LEWIS BELINDA    FEMALE
  • 1 name with tilde
  • Being used equivalent to single quote

Map tilde to empty string

3.6.26 Check for underscore

d %>% 
  dplyr::select(last_name, first_name, sex) %>%
  dplyr::filter(stringr::str_detect(last_name, "_"))
# A tibble: 1 x 3
  last_name      first_name sex   
  <chr>          <chr>      <chr> 
1 SOLARZ_VOJDANI JENNIFER   FEMALE
  • 1 name with underscore
  • Being used equivalent to hyphen

Map underscore to empty string

3.6.27 Check for percent

d %>% 
  dplyr::select(last_name, first_name, sex) %>%
  dplyr::filter(stringr::str_detect(last_name, "%"))
# A tibble: 1 x 3
  last_name     first_name sex   
  <chr>         <chr>      <chr> 
1 SCHERM%MARTIN WYATT      FEMALE
  • 1 name with percent
  • Being used equivalent to hyphen

Map percent to empty string

3.6.28 Check for other characters (2)

d %>% 
  dplyr::select(last_name) %>%
  dplyr::filter(stringr::str_detect(last_name, "[^-\\s'\"a-zA-Z015\\.,\\*/\\\\`~_%]"))
# A tibble: 37 x 1
   last_name         
   <chr>             
 1 ;PGEMAN           
 2 MCMILLIAN   (MUMFO
 3 MCQUEEN   (MORRISE
 4 SYKES (BRICKHOUSE)
 5 PARISH    (RAMON) 
 6 BAREFOOT   (RHINE)
 7 MV 5/17/95        
 8 HUDSON  (HALL)    
 9 FOSTER   (KING)   
10 KINLAW  (GUIN)    
# … with 27 more rows

UP TO HERE

Look at those in more detail.

Look at frequencies of names.

d %>% 
  dplyr::select(last_name) %>% 
  dplyr::count(last_name, sort = TRUE)
# A tibble: 269,313 x 2
   last_name      n
   <chr>      <int>
 1 SMITH     105215
 2 WILLIAMS   74940
 3 JOHNSON    70103
 4 JONES      69712
 5 BROWN      57265
 6 DAVIS      54381
 7 MOORE      40564
 8 MILLER     36539
 9 WILSON     35738
10 TAYLOR     33519
# … with 269,303 more rows

3.7 name_sufx_cd

name_sufx_cd: Voter name suffix

d %>% dplyr::select(name_sufx_cd) %>% skimr::skim()
Table 3.2: Data summary
Name Piped data
Number of rows 8003293
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name_sufx_cd 7561920 0.06 1 3 0 222 0
table(d$name_sufx_cd, useNA = "ifany")

      ?       '     (GE     (II     (JR     (SR      \\       `       0     040 
      2       2       1       1       4       1       2      20       3       1 
    070     072      08       1     106      11     111     134      15     181 
      1       1       1       7       1     101     241       1       1       1 
     1V       2     2ND       3     346      39     3RD       5     5TH      77 
      5       4       1       1       1       1      14       1       2       1 
      8     8TH       9       A     AJR     AKB     ALB     ALM     ANN     ARK 
      1       1       1       1       1       1       1       1       6       1 
    ART     ARV       B     BAL     BAS     BAU     BEA     BEL     BEN     BOU 
      1       1       6       1       1       1       2       1       1       1 
    BRA     BRI     BRO     BUC     BUN       C      C.     CAM     CHA     CLA 
      1       3       1       1       1      10       1       1       1       1 
    COY     CRA     CUB     CUM     CUT       D     DAN     DAV     DIC     DIG 
      2       1       1       1       1       6       3       1       1       1 
     DO     DOR     DOU     DOV      DR     DR.       E     EDW     ELE     ELI 
      3       1       1       1       1       4       5       1       1       1 
    ELS     ETT     EWA      EY       F     F M     FAU     FOR     FRE       G 
      1       1       1       1       7       1       1       2       2       4 
    GLE     GRE     GUY       H     HAM     HIL     HOG     HOO     HUS       I 
      1       1       1       3       1       1       1       2       1     566 
     II     II.     III     IIL     ILI      IN     ING     IRM     ITH      IV 
  26023       3   56928       1       1       2       1       1       1    6955 
    IV.      IX       J     JAC     JAM      JD     JEN     JOH     JON     JOS 
      2       1      17       1       1       4       1       1       1       2 
     jr      JR     JR,     Jr.     JR.       K     KAP     KEN     KIN     KIT 
      1  295262       1       2    2832       4       1       1       1       1 
      L     LAR     LEE     LEN     LES     LEW     LIN      LL     LLL     LOC 
      8       1       2       1       1       1       1       3       2       1 
    LOU     LYN       M     M D     MAC     MAE     MAT     MCK     MCQ     MCR 
      2       1      11       1       1       1       1       1       1       1 
     MD     MMO     MOO     MOR      MR     MR.     MRS      MS     MS.     MUR 
      6       1       1       1      11      17     123       6      18       1 
      N     NGT     NOC     NON     NOR      NS       O     O'S      OD     OLI 
      3       1       1       1       1       1       2       1       2       1 
     ON     ONG      OV       P     PAU     PET     PHE     PIL     PLA     POP 
      1       1       1       2       1       1       1       1       1       1 
      Q       R     RAY     REB     REE     REV     ROB     ROD     ROY       S 
      3      10       1       1       1      10       2       1       1       5 
    SAM     SCO     SMI     SOR      sr      SR     Sr.     SR.     STA     STE 
      1       2       1       1       1   50917       3     562       2       1 
    SUE     SUM     SWA       T      TA     TOB     TWA     UNK       V     VAN 
      1       1       1       2       1       1       1       1     345       1 
    VER      VI     VII     VIR     VOS       W     WAL     WAR     WIL     WOL 
      1      44      14       1       1       7       1       1       2       1 
      X       Y    <NA> 
      1       1 7561920 

4 Clean name variables

The aggregated cleaning suggestions are:

Name cleaning suggestions
Issue last_name first_name midl_name Action
Missing 122 254 553,015 Exclude record if first or last name missing
Lower case letters 50 24 169 Map all letters to upper case
Digits 90 81 299 Map digits to empty string if not otherwise mapped
Zero 67 73 130 Map zero to O if name contains at least one letter and no digits 1-9
One 20 3 163
Two 1 1 13
Three 1 0 13
Four 3 1 15
Five 3 0 17
Six 1 1 7
Seven 4 0 8
Eight 0 2 9
Nine 4 0 6
knitr::knit_exit()