Last updated: 2020-12-23
Checks: 7 0
Knit directory:
fa_sim_cal/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201104)
was run prior to running the code in the R Markdown file.
Setting a seed ensures that any results that rely on randomness, e.g.
subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version c6390cc. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the
analysis have been committed to Git prior to generating the results (you can
use wflow_publish
or wflow_git_commit
). workflowr only
checks the R Markdown file, but you know if there are other scripts or data
files that it depends on. Below is the status of the Git repository when the
results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .tresorit/
Ignored: data/VR_20051125.txt.xz
Ignored: data/test.txt
Ignored: data/test.txt.xz
Ignored: output/d.fst
Ignored: renv/library/
Ignored: renv/staging/
Untracked files:
Untracked: data/layout_VR_Snapshot.txt
Unstaged changes:
Modified: .gitignore
Modified: data/.gitignore
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made
to the R Markdown (analysis/01_get_check_data.Rmd
) and HTML (docs/01_get_check_data.html
)
files. If you’ve configured a remote Git repository (see
?wflow_git_remote
), click on the hyperlinks in the table below to
view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | c6390cc | Ross Gayler | 2020-12-23 | wflow_publish("analysis/*.Rmd") |
Rmd | 01b669c | Ross Gayler | 2020-12-10 | Build site. |
Rmd | bbb7d9d | Ross Gayler | 2020-12-07 | End of day |
Rmd | babb874 | Ross Gayler | 2020-12-06 | End of day |
library(here)
here() starts at /home/ross/RG/projects/academic/entity_resolution/fa_sim_cal_TOP/fa_sim_cal
library(magrittr)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(stringr)
library(vroom)
library(skimr)
library(knitr)
Read the data, characterise it to understand it, and check for possible gotchas.
This project uses historical voter registration data from the North Carolina State Board of Elections. This information is made publicly available in accordance with North Carolina state law. The Voter Registration Data page links to a folder of Voter Registration snapshots, which contains the snapshot data files and a metadata file describing the layout of the snapshot data files. At the time of writing the snapshot files cover the years 2005 to 2020 with at least one snapshot per year. The files are ZIP compressed and relatively large, with the smallest being 572 MB after compression.
The snapshots contains many columns that are irrelevant to this project and/or prohibited under Australian privacy law (e.g. political affiliation, race). We initially read all the columns, because that may help debugging the inevitable problems reading the data. Later the data set will be restricted to the essential columns for the project.
We use only one snapshot file (VR_Snapshot_20051125.zip) because this project does not investigate linkage of records across time. We chose the oldest snapshot (2005) because it is the smallest and the contents are the most out of date, minimising the current information made available. Note that this project will not generate any information that is not already directly, publicly available from NCSBE.
The snapshot ZIP file was downloaded, uncompressed (5.7 GB), then
compressed in XZ format to
minimise the size. The compressed snapshot file and the metadata file
are stored in the data
directory.
raw_file <- here::here("data", "VR_20051125.txt.xz") # raw input file
The cleaned data is stored as an fst
format file in the output
directory.
d_fst <- here::here("output", "d.fst") # temporary data file
clean_fst <- here::here("output", "clean.fst") # parsed and cleaned data as a dataframe
The data is tab-separated, not fixed-width as you might reasonably think from reading the metadata. The field widths (interpreted as maximum lengths) in the metadata are not accurate. Some fields contain values longer than the stated width.
Inspection of the raw data shows that the character fields are unquoted. However, at least one character value contains a double-quote character, which has the potential to confuse the parsing if it is looking for quoted values.
d <- vroom::vroom( #read raw data; let vroom guess the field types
raw_file,
delim = "\t", # assume that fields are *only* delimited by tabs
col_names = TRUE, # use the column names on the first line of data
na = "", # missing fields are empty string or whitespace only (see trim_ws argument)
quote = "", # don't allow for quoted strings
comment = "", # don't allow for comments
trim_ws = TRUE, # trim leading and trailing whitespace
escape_double = FALSE, # assume no escaped quotes
escape_backslash = FALSE # assume no escaped backslashes
)
fst::write_fst(d, d_fst, compress = 100) # save data frame (cheap-skate caching)
d <- fst::read_fst(d_fst) %>% tibble::as_tibble() # get cached data
dim(d)
[1] 8003293 90
Take a very quick look at everything then concentrate on the columns that have a chance of being useful.
glimpse(d)
Rows: 8,003,293
Columns: 90
$ snapshot_dt <dttm> 2005-11-25, 2005-11-25, 2005-11-25, 2005-11…
$ county_id <dbl> 18, 7, 10, 16, 58, 60, 62, 73, 74, 87, 99, 3…
$ county_desc <chr> "CATAWBA", "BEAUFORT", "BRUNSWICK", "CARTERE…
$ voter_reg_num <chr> "0", "000000000000", "000000000000", "000000…
$ ncid <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ status_cd <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R",…
$ voter_status_desc <chr> "REMOVED", "REMOVED", "REMOVED", "REMOVED", …
$ reason_cd <chr> "RL", "R2", "R2", "RP", "R2", "RL", "RP", "R…
$ voter_status_reason_desc <chr> "MOVED FROM COUNTY", "DUPLICATE", "DUPLICATE…
$ absent_ind <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ name_prefx_cd <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ last_name <chr> "AARON", "THOMPSON", "WILSON", "LANGSTON", "…
$ first_name <chr> "CHARLES", "JESSICA", "WILLIAM", "VON", "LIZ…
$ midl_name <chr> "F", "RUTH", "B", NA, "IRENE", "R", "HUGHES"…
$ name_sufx_cd <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ house_num <dbl> 0, 961, 0, 264, 1536, 1431, 171, 0, 0, 1000,…
$ half_code <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ street_dir <chr> NA, NA, NA, NA, NA, "E", NA, NA, NA, NA, NA,…
$ street_name <chr> "ROUTE 4", "TAYLOR", "MIRROR LAKE", "CARL GA…
$ street_type_cd <chr> NA, "RD", NA, "RD", "RD", "ST", NA, NA, NA, …
$ street_sufx_cd <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ unit_designator <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ unit_num <chr> "147 BA", NA, NA, NA, NA, "1", NA, NA, NA, N…
$ res_city_desc <chr> "CONOVER", "CHOCOWINITY", "BOILING SPRING LA…
$ state_cd <chr> "NC", "NC", "NC", "NC", "NC", "NC", "NC", NA…
$ zip_code <dbl> 28613, 27817, 28461, 28570, 27892, 28204, 27…
$ mail_addr1 <chr> NA, "619A FOUNDERS HALL, CP0 # 9100", NA, NA…
$ mail_addr2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ mail_addr3 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ mail_addr4 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ mail_city <chr> NA, "ASHEVILLE", NA, NA, NA, NA, "CANDOR", N…
$ mail_state <chr> NA, "NC", NA, NA, NA, NA, "NC", NA, NA, NA, …
$ mail_zipcode <dbl> NA, 0, NA, NA, NA, NA, 27229, NA, NA, NA, NA…
$ area_cd <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ phone_num <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ race_code <chr> "W", "W", "U", "B", "W", "W", "W", "U", "U",…
$ race_desc <chr> "WHITE", "WHITE", "UNDESIGNATED", "BLACK or …
$ ethnic_code <chr> "NL", "NL", "NL", "NL", "NL", "NL", "NL", "N…
$ ethnic_desc <chr> "NOT HISPANIC or NOT LATINO", "NOT HISPANIC …
$ party_cd <chr> "REP", "REP", "UNA", "DEM", "REP", "UNA", "D…
$ party_desc <chr> "REPUBLICAN", "REPUBLICAN", "UNAFFILIATED", …
$ sex_code <chr> "M", "F", "U", "M", "F", "F", "M", "U", "U",…
$ sex <chr> "MALE", "FEMALE", "UNK", "MALE", "FEMALE", "…
$ age <dbl> 62, 26, 0, 58, 63, 30, 93, 0, 0, 82, 57, 72,…
$ birth_place <chr> NA, "NC", NA, "MI", NA, "VA", "NC", NA, NA, …
$ registr_dt <dttm> 1984-10-06, 2000-07-31, 1900-01-01, 1978-04…
$ precinct_abbrv <chr> NA, "CHOCO", NA, NA, NA, NA, NA, NA, NA, "BC…
$ precinct_desc <chr> NA, "CHOCOWINITY", NA, NA, NA, NA, NA, NA, N…
$ municipality_abbrv <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "JNV…
$ municipality_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "JON…
$ ward_abbrv <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ward_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ cong_dist_abbrv <chr> NA, "01", NA, NA, NA, NA, NA, NA, NA, "11", …
$ cong_dist_desc <chr> NA, "1ST CONGRESS", NA, NA, NA, NA, NA, NA, …
$ super_court_abbrv <chr> NA, "02", NA, NA, NA, NA, NA, NA, NA, "30A",…
$ super_court_desc <chr> NA, "2ND SUPERIOR COURT", NA, NA, NA, NA, NA…
$ judic_dist_abbrv <chr> NA, "02", NA, NA, NA, NA, NA, NA, NA, "30", …
$ judic_dist_desc <chr> NA, "2ND JUDICIAL", NA, NA, NA, NA, NA, NA, …
$ NC_senate_abbrv <chr> NA, "01", NA, NA, NA, NA, NA, NA, NA, "50", …
$ NC_senate_desc <chr> NA, "1ST SENATE", NA, NA, NA, NA, NA, NA, NA…
$ NC_house_abbrv <chr> NA, "006", NA, NA, NA, NA, NA, NA, NA, "119"…
$ NC_house_desc <chr> NA, "6TH HOUSE", NA, NA, NA, NA, NA, NA, NA,…
$ county_commiss_abbrv <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ county_commiss_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ township_abbrv <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ township_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ school_dist_abbrv <chr> NA, "SD2", NA, NA, NA, NA, NA, NA, NA, NA, N…
$ school_dist_desc <chr> NA, "SCHOOL #2", NA, NA, NA, NA, NA, NA, NA,…
$ fire_dist_abbrv <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ fire_dist_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ water_dist_abbrv <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ water_dist_desc <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sewer_dist_abbrv <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sewer_dist_desc <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sanit_dist_abbrv <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ sanit_dist_desc <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ rescue_dist_abbrv <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ rescue_dist_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ munic_dist_abbrv <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ munic_dist_desc <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ dist_1_abbrv <chr> NA, "02", NA, NA, NA, NA, NA, NA, NA, "30", …
$ dist_1_desc <chr> NA, "2ND PROSECUTORIAL", NA, NA, NA, NA, NA,…
$ dist_2_abbrv <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ dist_2_desc <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ confidential_ind <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N",…
$ cancellation_dt <dttm> NA, 2001-07-06, 2001-02-05, NA, 2001-03-15,…
$ vtd_abbrv <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ vtd_desc <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ load_dt <dttm> 2014-07-15 22:21:54, 2014-07-15 22:21:54, 2…
$ age_group <chr> "41 TO 65", "26 TO 40", "UNKNOWN", "41 TO 65…
skimr::skim(d)
Warning in grepl("^\\s+$", x): input string 3907396 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 3975334 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 388213 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 503879 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 817815 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 7446786 is invalid in this locale
Warning in grepl("^\\s+$", x): input string 7446791 is invalid in this locale
Name | d |
Number of rows | 8003293 |
Number of columns | 90 |
_______________________ | |
Column type frequency: | |
character | 59 |
logical | 20 |
numeric | 7 |
POSIXct | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
county_desc | 0 | 1.00 | 3 | 12 | 0 | 100 | 0 |
voter_reg_num | 0 | 1.00 | 1 | 12 | 0 | 2708878 | 0 |
status_cd | 2 | 1.00 | 1 | 1 | 0 | 5 | 0 |
voter_status_desc | 2 | 1.00 | 6 | 22 | 0 | 5 | 0 |
reason_cd | 238 | 1.00 | 2 | 2 | 0 | 26 | 0 |
voter_status_reason_desc | 238 | 1.00 | 8 | 56 | 0 | 26 | 0 |
last_name | 122 | 1.00 | 1 | 23 | 0 | 269312 | 0 |
first_name | 254 | 1.00 | 1 | 19 | 0 | 176806 | 0 |
midl_name | 553015 | 0.93 | 1 | 20 | 0 | 249768 | 0 |
name_sufx_cd | 7561920 | 0.06 | 1 | 3 | 0 | 222 | 0 |
street_dir | 7409655 | 0.07 | 1 | 2 | 0 | 15 | 0 |
street_name | 7768 | 1.00 | 1 | 30 | 0 | 122064 | 0 |
street_type_cd | 527462 | 0.93 | 1 | 4 | 0 | 215 | 0 |
street_sufx_cd | 7698925 | 0.04 | 1 | 3 | 0 | 15 | 0 |
unit_num | 7020919 | 0.12 | 1 | 7 | 0 | 32785 | 0 |
res_city_desc | 3750 | 1.00 | 3 | 20 | 0 | 856 | 0 |
state_cd | 7277 | 1.00 | 1 | 2 | 0 | 20 | 0 |
mail_addr1 | 6814780 | 0.15 | 1 | 40 | 0 | 421307 | 0 |
mail_city | 6819798 | 0.15 | 1 | 30 | 0 | 4168 | 0 |
mail_state | 6819868 | 0.15 | 1 | 2 | 0 | 104 | 0 |
phone_num | 5370357 | 0.33 | 1 | 7 | 0 | 1539509 | 0 |
race_code | 0 | 1.00 | 1 | 1 | 0 | 7 | 0 |
race_desc | 0 | 1.00 | 5 | 34 | 0 | 7 | 0 |
ethnic_code | 0 | 1.00 | 2 | 2 | 0 | 3 | 0 |
ethnic_desc | 0 | 1.00 | 12 | 26 | 0 | 3 | 0 |
party_cd | 0 | 1.00 | 3 | 3 | 0 | 4 | 0 |
party_desc | 0 | 1.00 | 10 | 13 | 0 | 5 | 0 |
sex_code | 0 | 1.00 | 1 | 1 | 0 | 3 | 0 |
sex | 0 | 1.00 | 3 | 6 | 0 | 3 | 0 |
birth_place | 1716730 | 0.79 | 2 | 2 | 0 | 56 | 0 |
precinct_abbrv | 1865111 | 0.77 | 1 | 6 | 0 | 1867 | 0 |
precinct_desc | 1865111 | 0.77 | 2 | 30 | 0 | 2686 | 0 |
municipality_abbrv | 4396616 | 0.45 | 1 | 4 | 0 | 429 | 0 |
municipality_desc | 4396616 | 0.45 | 4 | 26 | 0 | 571 | 0 |
ward_abbrv | 6116249 | 0.24 | 1 | 4 | 0 | 197 | 0 |
ward_desc | 6116249 | 0.24 | 1 | 28 | 0 | 256 | 0 |
cong_dist_abbrv | 1865114 | 0.77 | 2 | 2 | 0 | 13 | 0 |
cong_dist_desc | 1865114 | 0.77 | 2 | 27 | 0 | 46 | 0 |
super_court_abbrv | 1872590 | 0.77 | 2 | 4 | 0 | 68 | 0 |
super_court_desc | 1872590 | 0.77 | 2 | 30 | 0 | 78 | 0 |
judic_dist_abbrv | 1872576 | 0.77 | 2 | 3 | 0 | 40 | 0 |
judic_dist_desc | 1872576 | 0.77 | 2 | 23 | 0 | 54 | 0 |
NC_senate_abbrv | 1836472 | 0.77 | 2 | 2 | 0 | 50 | 0 |
NC_senate_desc | 1836472 | 0.77 | 6 | 24 | 0 | 63 | 0 |
NC_house_abbrv | 1829345 | 0.77 | 3 | 3 | 0 | 120 | 0 |
NC_house_desc | 1829345 | 0.77 | 6 | 25 | 0 | 125 | 0 |
county_commiss_abbrv | 4365150 | 0.45 | 1 | 4 | 0 | 126 | 0 |
county_commiss_desc | 4365150 | 0.45 | 2 | 30 | 0 | 131 | 0 |
township_abbrv | 6760420 | 0.16 | 1 | 4 | 0 | 119 | 0 |
township_desc | 6760420 | 0.16 | 1 | 27 | 0 | 223 | 0 |
school_dist_abbrv | 3380612 | 0.58 | 1 | 7 | 0 | 140 | 0 |
school_dist_desc | 3380612 | 0.58 | 2 | 30 | 0 | 145 | 0 |
fire_dist_abbrv | 7650404 | 0.04 | 1 | 4 | 0 | 82 | 0 |
fire_dist_desc | 7650404 | 0.04 | 5 | 27 | 0 | 107 | 0 |
rescue_dist_desc | 7885291 | 0.01 | 10 | 16 | 0 | 13 | 0 |
dist_1_abbrv | 1865111 | 0.77 | 2 | 3 | 0 | 39 | 0 |
dist_1_desc | 1865111 | 0.77 | 2 | 27 | 0 | 51 | 0 |
confidential_ind | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
age_group | 0 | 1.00 | 7 | 12 | 0 | 6 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
ncid | 8003293 | 0 | NaN | : |
absent_ind | 8003293 | 0 | NaN | : |
name_prefx_cd | 8003293 | 0 | NaN | : |
half_code | 8002085 | 0 | 0.38 | FAL: 752, TRU: 456 |
unit_designator | 8003293 | 0 | NaN | : |
mail_addr2 | 8003292 | 0 | 1.00 | TRU: 1 |
mail_addr3 | 8003293 | 0 | NaN | : |
mail_addr4 | 8003293 | 0 | NaN | : |
water_dist_abbrv | 7998651 | 0 | 1.00 | TRU: 4642 |
water_dist_desc | 8000971 | 0 | 1.00 | TRU: 2322 |
sewer_dist_abbrv | 8002465 | 0 | 1.00 | TRU: 828 |
sewer_dist_desc | 8003293 | 0 | NaN | : |
sanit_dist_abbrv | 7997607 | 0 | 0.11 | FAL: 5069, TRU: 617 |
sanit_dist_desc | 8003293 | 0 | NaN | : |
munic_dist_abbrv | 8002280 | 0 | 1.00 | TRU: 1013 |
munic_dist_desc | 8002280 | 0 | 1.00 | TRU: 1013 |
dist_2_abbrv | 8003293 | 0 | NaN | : |
dist_2_desc | 8003293 | 0 | NaN | : |
vtd_abbrv | 8003293 | 0 | NaN | : |
vtd_desc | 8003293 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
county_id | 0 | 1.00 | 51.96 | 27.31 | 1 | 32 | 51 | 74 | 100 | ▅▇▇▆▆ |
house_num | 0 | 1.00 | 2664.17 | 706533.11 | 0 | 210 | 900 | 3032 | 1400000000 | ▇▁▁▁▁ |
zip_code | 17957 | 1.00 | 30806.46 | 890299.61 | 0 | 27523 | 28027 | 28401 | 289309205 | ▇▁▁▁▁ |
mail_zipcode | 6819826 | 0.15 | 24463505.17 | 78280243.02 | -27379 | 27812 | 28345 | 28699 | 987725001 | ▇▁▁▁▁ |
area_cd | 5621640 | 0.30 | 696.09 | 259.80 | -83 | 336 | 828 | 910 | 999 | ▁▃▁▂▇ |
age | 0 | 1.00 | 48.71 | 21.28 | 0 | 34 | 46 | 60 | 7644 | ▇▁▁▁▁ |
rescue_dist_abbrv | 7885291 | 0.01 | 47.54 | 10.66 | 12 | 41 | 54 | 55 | 88 | ▁▃▇▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
snapshot_dt | 0 | 1.00 | 2005-11-25 00:00:00 | 2005-11-25 00:00:00 | 2005-11-25 00:00:00 | 1 |
registr_dt | 0 | 1.00 | 1805-08-01 00:00:00 | 9999-10-21 00:00:00 | 1995-02-22 00:00:00 | 75089 |
cancellation_dt | 6240946 | 0.22 | 1988-12-06 00:00:00 | 2005-11-23 00:00:00 | 2003-01-13 00:00:00 | 3975 |
load_dt | 0 | 1.00 | 2014-07-15 22:21:54 | 2014-07-15 22:21:54 | 2014-07-15 22:21:54 | 1 |
skim()
indicate that a handful of rows
contain unexpected characters. If they are in rows we use they will
have to be loacted and dealt with.county_id
: County identification number
county_desc
: County description
summary(d$county_id)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 32.00 51.00 51.96 74.00 100.00
table(d$county_id)
1 2 3 4 5 6 7 8 9 10 11
142978 32527 10606 21692 23969 19839 41680 17955 30594 83782 196729
12 13 14 15 16 17 18 19 20 21 22
76685 129792 68065 8558 64968 19521 158203 45262 27093 11937 10507
23 24 25 26 27 28 29 30 31 32 33
74946 49306 98242 250411 20356 33701 124906 27493 34816 324683 51893
34 35 36 37 38 39 40 41 42 43 44
350882 41914 140041 8765 8810 40697 15672 473739 43500 74559 63376
45 46 47 48 49 50 51 52 53 54 55
101988 20477 29683 5004 103777 39634 111748 10102 40144 50822 62544
56 57 58 59 60 61 62 63 64 65 66
31436 19721 24684 38463 697897 15768 24296 67268 81129 185852 17200
67 68 69 70 71 72 73 74 75 76 77
106315 227603 15232 40871 40743 10037 30596 179177 23721 107895 36649
78 79 80 81 82 83 84 85 86 87 88
90736 87196 115178 49978 45743 28494 52563 37016 53848 14744 38191
89 90 91 92 93 94 95 96 97 98 99
3445 122676 39355 678226 19534 15399 59440 81209 55298 70609 31275
100
19014
table(d$county_desc)
ALAMANCE ALEXANDER ALLEGHANY ANSON ASHE AVERY
142978 32527 10606 21692 23969 19839
BEAUFORT BERTIE BLADEN BRUNSWICK BUNCOMBE BURKE
41680 17955 30594 83782 196729 76685
CABARRUS CALDWELL CAMDEN CARTERET CASWELL CATAWBA
129792 68065 8558 64968 19521 158203
CHATHAM CHEROKEE CHOWAN CLAY CLEVELAND COLUMBUS
45262 27093 11937 10507 74946 49306
CRAVEN CUMBERLAND CURRITUCK DARE DAVIDSON DAVIE
98242 250411 20356 33701 124906 27493
DUPLIN DURHAM EDGECOMBE FORSYTH FRANKLIN GASTON
34816 324683 51893 350882 41914 140041
GATES GRAHAM GRANVILLE GREENE GUILFORD HALIFAX
8765 8810 40697 15672 473739 43500
HARNETT HAYWOOD HENDERSON HERTFORD HOKE HYDE
74559 63376 101988 20477 29683 5004
IREDELL JACKSON JOHNSTON JONES LEE LENOIR
103777 39634 111748 10102 40144 50822
LINCOLN MACON MADISON MARTIN MCDOWELL MECKLENBURG
62544 31436 19721 24684 38463 697897
MITCHELL MONTGOMERY MOORE NASH NEW HANOVER NORTHAMPTON
15768 24296 67268 81129 185852 17200
ONSLOW ORANGE PAMLICO PASQUOTANK PENDER PERQUIMANS
106315 227603 15232 40871 40743 10037
PERSON PITT POLK RANDOLPH RICHMOND ROBESON
30596 179177 23721 107895 36649 90736
ROCKINGHAM ROWAN RUTHERFORD SAMPSON SCOTLAND STANLY
87196 115178 49978 45743 28494 52563
STOKES SURRY SWAIN TRANSYLVANIA TYRRELL UNION
37016 53848 14744 38191 3445 122676
VANCE WAKE WARREN WASHINGTON WATAUGA WAYNE
39355 678226 19534 15399 59440 81209
WILKES WILSON YADKIN YANCEY
55298 70609 31275 19014
They look reasonable, to the extent that I can tell without knowing anything about the counties.
voter_reg_num
: Voter registration number (unique by county)
table(d$voter_reg_num) %>% head(12)
0 000000000000 000000000001 000000000002 000000000003 000000000004
1 10 56 64 65 66
000000000005 000000000006 000000000007 000000000008 000000000009 000000000010
61 65 70 64 75 71
table(d$voter_reg_num) %>% tail(12)
000999834828 000999834834 000999834837 000999834845 000999834860 000999834869
1 1 1 1 1 1
000999834879 000999834883 000999834884 000999834888 000999834892 000999834900
1 1 1 1 1 1
summary(as.integer(d$voter_reg_num))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 36265 155221 5459965 3039980 999834900
d$voter_reg_num %>% stringr::str_length() %>% table(useNA = "ifany")
.
1 12
1 8003292
Look at the record with the short value.
d %>%
dplyr::filter(stringr::str_length(voter_reg_num) < 12) %>%
dplyr::select(county_id, voter_reg_num, status_cd, voter_status_desc, reason_cd, voter_status_reason_desc) %>%
knitr::kable()
county_id | voter_reg_num | status_cd | voter_status_desc | reason_cd | voter_status_reason_desc |
---|---|---|---|---|---|
18 | 0 | R | REMOVED | RL | MOVED FROM COUNTY |
Check whether county_id x voter_reg_num
is unique, as claimed.
d %>%
dplyr::select(county_id, voter_reg_num) %>%
dplyr::mutate(id = stringr::str_c(as.character(county_id), ".", voter_reg_num)) %>%
dplyr::count(id) %>%
with(table(n))
n
1
8003293
county_id x voter_reg_num
is unique, even including observations
flagged as duplicates.ncid
: North Carolina identification number (NCID) of voter
That’s a shame. It would have been useful.
status_cd
: Status code for voter registration
voter_status_desc
: Status code description
table(d$status_cd, useNA = "always")
A D I R S <NA>
4914521 41348 495603 2546485 5334 2
table(d$voter_status_desc, useNA = "always")
ACTIVE DENIED INACTIVE
4914521 41348 495603
REMOVED TEMPORARY REGISTRATION <NA>
2546485 5334 2
reason_cd
: Reason code for voter registration status
voter_status_reason_desc
: Reason code description
table(d$reason_cd, useNA = "always")
A1 A2 AA AL AN AP AV DI DU IL
13737 71296 50 523899 7517 198333 4100220 6991 34357 10585
IN IU R2 RA RC RD RF RL RM RP
181320 303197 78951 59008 662 443486 63501 888056 551073 367511
RQ RS RT SM SO SP <NA>
4194 89049 729 3975 1307 51 238
table(d$voter_status_reason_desc, useNA = "always")
ADMINISTRATIVE
59008
ARMED FORCES
50
CONFIRMATION NOT RETURNED
181320
CONFIRMATION PENDING
71296
CONFIRMATION RETURNED UNDELIVERABLE
303197
DECEASED
443486
DUPLICATE
78951
FELONY CONVICTION
63501
LEGACY - CONVERSION
10585
LEGACY DATA
523899
MILITARY
3975
MOVED FROM COUNTY
888056
MOVED FROM STATE
89049
OVERSEAS CITIZEN
1307
PREVIOUSLY REGISTERED
51
REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
551073
REMOVED DUE TO SUSTAINED CHALLENGE
662
REMOVED UNDER OLD PURGE LAW
367511
REQUEST FROM VOTER
4194
TEMPORARY REGISTRANT
729
UNAVAILABLE ESSENTIAL INFORMATION
6991
UNVERIFIED
13737
UNVERIFIED NEW
7517
VERIFICATION PENDING
198333
VERIFICATION RETURNED UNDELIVERABLE
34357
VERIFIED
4100220
<NA>
238
Look at the relationship between status and status reason.
table(
stringr::str_trunc(d$voter_status_reason_desc, 25),
stringr::str_trunc(d$voter_status_desc, 8),
useNA = "always"
)
ACTIVE DENIED INACTIVE REMOVED TEMPO... <NA>
ADMINISTRATIVE 0 0 0 59008 0 0
ARMED FORCES 50 0 0 0 0 0
CONFIRMATION NOT RETURNED 0 0 181320 0 0 0
CONFIRMATION PENDING 71295 0 0 1 0 0
CONFIRMATION RETURNED ... 0 0 303197 0 0 0
DECEASED 0 0 0 443486 0 0
DUPLICATE 0 0 0 78951 0 0
FELONY CONVICTION 0 0 0 63501 0 0
LEGACY - CONVERSION 1 0 10584 0 0 0
LEGACY DATA 523897 0 2 0 0 0
MILITARY 0 0 0 0 3975 0
MOVED FROM COUNTY 0 0 0 888055 0 1
MOVED FROM STATE 0 0 0 89049 0 0
OVERSEAS CITIZEN 0 0 0 0 1307 0
PREVIOUSLY REGISTERED 0 0 0 1 50 0
REMOVED AFTER 2 FED GE... 0 0 0 551072 0 1
REMOVED DUE TO SUSTAIN... 0 0 0 662 0 0
REMOVED UNDER OLD PURG... 0 0 0 367511 0 0
REQUEST FROM VOTER 0 0 0 4194 0 0
TEMPORARY REGISTRANT 0 0 0 729 0 0
UNAVAILABLE ESSENTIAL ... 0 6990 0 1 0 0
UNVERIFIED 13731 0 0 4 2 0
UNVERIFIED NEW 7516 0 0 1 0 0
VERIFICATION PENDING 198331 0 1 1 0 0
VERIFICATION RETURNED ... 0 34357 0 0 0 0
VERIFIED 4099700 1 499 20 0 0
<NA> 0 0 0 238 0 0
voter_status_desc == “ACTIVE” & voter_status_reason_desc == “VERIFIED”
Identify any oddities about the name fields that might benefit from standardisation.
I will do this on all the rows, not just the subset to be analysed, because I expect the oddities to be much the same independently of whether I will exclude the rows from the analyses and the larger sample size will be helpful in spotting rare problems.
I will look at the three name fields concurrently because I expect the oddities to be similar across the name fields.
last_name
: Voter last namefirst_name
: Voter first namemidl_name
: Voter middle nameLook for possible anomalies in names.
d %>% with(table(is.na(last_name)))
FALSE TRUE
8003171 122
d %>% with(table(is.na(first_name)))
FALSE TRUE
8003039 254
d %>% with(table(is.na(midl_name)))
FALSE TRUE
7450278 553015
Look at the records missing last or first names to see if there is some explanation for their absence.
# last name missing
d %>%
dplyr::filter(is.na(last_name)) %>%
dplyr::select(
first_name, midl_name, name_sufx_cd,
sex, age,
house_num, street_name,
voter_status_desc, voter_status_reason_desc
) %>%
dplyr::arrange(voter_status_desc, voter_status_reason_desc, first_name) %>%
knitr::kable()
first_name | midl_name | name_sufx_cd | sex | age | house_num | street_name | voter_status_desc | voter_status_reason_desc |
---|---|---|---|---|---|---|---|---|
CHRISTINA | GAYLE | NA | FEMALE | 27 | 6024 | ROY LEE WOODS | REMOVED | ADMINISTRATIVE |
STEPHANIE | ELISE | NA | FEMALE | 25 | 4877 | COLLEGE ACRES | REMOVED | ADMINISTRATIVE |
WILLIAM | TODD | NA | MALE | 41 | 1396 | US HWY 221 | REMOVED | ADMINISTRATIVE |
A | J | NA | FEMALE | 94 | 146 | RURAL RTE 1 | REMOVED | DECEASED |
ALBERT | FREEMAN | NA | MALE | 82 | 0 | UNKNOWN | REMOVED | DECEASED |
BROUNDA | KAY | NA | FEMALE | 58 | 207 | 7TH | REMOVED | DECEASED |
CLARENCE | EDWARD | NA | MALE | 85 | 230 | WAYCROSS | REMOVED | DECEASED |
COLON | WALTER | NA | MALE | 71 | 642 | ANNS | REMOVED | DECEASED |
ELOISE | L | NA | FEMALE | 0 | 0 | RURAL RTE 2 | REMOVED | DECEASED |
GENE | EDWARD | NA | MALE | 74 | 105 | PERRY FOX | REMOVED | DECEASED |
HELEN | KOOPS | NA | FEMALE | 89 | 43 | RANGEVIEW ACRES | REMOVED | DECEASED |
JAMES | A | NA | MALE | 75 | 28 | PINE SHORE | REMOVED | DECEASED |
JAMES | EARL | NA | MALE | 69 | 3845 | ALLISON | REMOVED | DECEASED |
JOHN | ROBERT | NA | MALE | 87 | 0 | FRYEMONT | REMOVED | DECEASED |
MARTHA | BOATRIGHT | NA | FEMALE | 77 | 4006 | RICHLANDS | REMOVED | DECEASED |
MELISSA | O | NA | FEMALE | 39 | 4912 | DEVIL’S RACETRACK | REMOVED | DECEASED |
VERA | M | NA | FEMALE | 76 | 231 | PO BOX | REMOVED | DECEASED |
VOLA | B | NA | FEMALE | 98 | 0 | TALLULAH | REMOVED | DECEASED |
CHARLES | EMMETT | NA | MALE | 73 | 340 | VANDERBILT | REMOVED | DUPLICATE |
FANNIE | N | NA | FEMALE | 77 | 115 | CAROLINA | REMOVED | DUPLICATE |
PATRICIA | C | NA | FEMALE | 75 | 1425 | WARRIOR | REMOVED | DUPLICATE |
PAULINE | NA | NA | FEMALE | 56 | 8 | REDBUD | REMOVED | DUPLICATE |
ROBERT | ERIC | NA | MALE | 40 | 56 | KILGORE | REMOVED | DUPLICATE |
VIRGINIA | L | NA | FEMALE | 90 | 428 | RINK DAM | REMOVED | DUPLICATE |
WELDON | COX | NA | MALE | 76 | 109 | MCNEILL | REMOVED | DUPLICATE |
DEONTRAYVIA | EMANUEL | NA | MALE | 30 | 1508 | CHARLES | REMOVED | FELONY CONVICTION |
JANE | ANN | NA | FEMALE | 26 | 157 | MT PILOT MHP | REMOVED | FELONY CONVICTION |
KIM | LEE | NA | MALE | 51 | 114 | LYLE KNOB | REMOVED | FELONY CONVICTION |
LEANDER | WARREN | NA | MALE | 43 | 53 | MOUNTAIN VIEW | REMOVED | FELONY CONVICTION |
MIKE | J | NA | MALE | 51 | 0 | SNIDER | REMOVED | FELONY CONVICTION |
SHIRLEY | GRIFFIN | NA | FEMALE | 40 | 1138 | ROCKY RUN | REMOVED | FELONY CONVICTION |
WESLEY | WILSON | NA | MALE | 41 | 195 | HIGH POINT | REMOVED | FELONY CONVICTION |
WILLIAM | RAY | NA | MALE | 43 | 412 | OAK | REMOVED | FELONY CONVICTION |
AMY | DENISE | NA | FEMALE | 34 | 5439 | LILLY FLOWER | REMOVED | MOVED FROM COUNTY |
ANDREA | CROUCH | NA | FEMALE | 35 | 90 | FOREST OAKS | REMOVED | MOVED FROM COUNTY |
CAROLYN | MOORE | NA | FEMALE | 56 | 9830 | RIDGEVILLE | REMOVED | MOVED FROM COUNTY |
DAVID | DEAN | NA | MALE | 38 | 98 | CEDAR | REMOVED | MOVED FROM COUNTY |
FREDDA | M | NA | FEMALE | 82 | 16930 | KNOXWOOD | REMOVED | MOVED FROM COUNTY |
JAMES | DONALD | III | MALE | 45 | 49 | WASHINGTON | REMOVED | MOVED FROM COUNTY |
JESSIE | H | NA | FEMALE | 81 | 206 | JONES | REMOVED | MOVED FROM COUNTY |
JUDITH | A | NA | FEMALE | 44 | 2338 | PROVIDENCE CREEK | REMOVED | MOVED FROM COUNTY |
KATHLEEN | LOUISE | NA | FEMALE | 23 | 302 | UNIVERSITY | REMOVED | MOVED FROM COUNTY |
KELLY | R | NA | FEMALE | 38 | 2315 | TORRINGTON | REMOVED | MOVED FROM COUNTY |
LARRY | ANTHONY | SR | MALE | 46 | 0 | TRENT | REMOVED | MOVED FROM COUNTY |
LARRY | DALLAS | NA | MALE | 63 | 1045 | HUNTER CREEK | REMOVED | MOVED FROM COUNTY |
MARY | MOSELEY | NA | FEMALE | 46 | 407 | BUTLER | REMOVED | MOVED FROM COUNTY |
MATTHEW | JAMES | NA | MALE | 25 | 243 | 7TH | REMOVED | MOVED FROM COUNTY |
MIRANDA | MARIE | NA | FEMALE | 23 | 908 | LOGAN | REMOVED | MOVED FROM COUNTY |
NATALIE | BASSHAM | NA | FEMALE | 32 | 4801 | HOWE | REMOVED | MOVED FROM COUNTY |
PATSY | D | NA | FEMALE | 50 | 825 | CENTER | REMOVED | MOVED FROM COUNTY |
SHIELA | WEST | NA | FEMALE | 57 | 176 | SHARON VALLEY | REMOVED | MOVED FROM COUNTY |
STELLA | NORWOOD | NA | FEMALE | 41 | 137 | CAMBRIDGE | REMOVED | MOVED FROM COUNTY |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | MOVED FROM COUNTY |
HENRY | RAY | NA | MALE | 69 | 794 | PAUL PAYNE STORE | REMOVED | MOVED FROM STATE |
JASON | M | NA | MALE | 35 | 11421 | FOUNTAINGROVE | REMOVED | MOVED FROM STATE |
L | KENT | NA | MALE | 65 | 5 | PALM | REMOVED | MOVED FROM STATE |
LINDA | LOU | NA | FEMALE | 58 | 134 | SAM RICHARDSON | REMOVED | MOVED FROM STATE |
ROBERT | CARL | NA | MALE | 56 | 867 | GEORGE’S GAP | REMOVED | MOVED FROM STATE |
ROY | W | NA | MALE | 0 | 1329 | DEVONSHIRE | REMOVED | MOVED FROM STATE |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
DUOC | VAN | DO | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
JEREMY | SEAN | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
L | F | III | MALE | 58 | 520 | CRAVEN | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | 08 | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | UNKNOWN | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
last_name
are REMOVED. Perhaps it’s a
side-effect of the removal process.# first name missing
d %>%
dplyr::filter(is.na(first_name)) %>%
dplyr::select(
last_name, midl_name, name_sufx_cd,
sex, age,
house_num, street_name,
voter_status_desc, voter_status_reason_desc
) %>%
dplyr::arrange(voter_status_desc, voter_status_reason_desc, midl_name) %>%
knitr::kable()
last_name | midl_name | name_sufx_cd | sex | age | house_num | street_name | voter_status_desc | voter_status_reason_desc |
---|---|---|---|---|---|---|---|---|
TRIANOSKY | SUSAN SMITH | NA | FEMALE | 46 | 2493 | US HWY 221 | ACTIVE | CONFIRMATION PENDING |
JOSEY | BETTY | NA | FEMALE | 61 | 1688 | WALTERS MILL | ACTIVE | LEGACY DATA |
PARRISH | BRENDA | NA | FEMALE | 59 | 461 | SHADY GROVE | ACTIVE | LEGACY DATA |
ROBINSON | JACQUELINE P | NA | FEMALE | 39 | 402 | MAIN | ACTIVE | LEGACY DATA |
UNDERWOOD | REGINA | NA | FEMALE | 46 | 130 | CASWELL | ACTIVE | LEGACY DATA |
JONES LARRY MALLOR | NA | JR | MALE | 98 | 0 | RURAL RTE 5 | ACTIVE | LEGACY DATA |
HOLMAN | HOWARD | NA | MALE | 41 | 402 | AVALON | ACTIVE | UNVERIFIED NEW |
YABIN | NA | NA | MALE | 53 | 2715 | WESTERWOOD VILLAGE | ACTIVE | VERIFICATION PENDING |
MORRIS | ALEXANDER | NA | MALE | 30 | 8118 | WOODWAY OAK | ACTIVE | VERIFIED |
BULLARD | ALEXIS | NA | UNK | 19 | 108 | THIRD | ACTIVE | VERIFIED |
ZIMMER | CLIFFORD | NA | MALE | 64 | 33 | GREEN SPRINGS | ACTIVE | VERIFIED |
CHESTER | JAMES | NA | UNK | 39 | 116 | ASH LANDING | ACTIVE | VERIFIED |
ALEXANDER | JASON | NA | MALE | 28 | 165 | ARLEE | ACTIVE | VERIFIED |
PATTERSON | JOHN DEXTER | III | MALE | 55 | 3707 | PINECREST | ACTIVE | VERIFIED |
MCKEEL | LESTER | NA | MALE | 77 | 3191 | BIG DADDYS | ACTIVE | VERIFIED |
FRISBY | M | JR | MALE | 33 | 145 | CABARRUS GRAVES DORM | ACTIVE | VERIFIED |
FUQUA | MARY | NA | FEMALE | 59 | 6757 | NC HIGHWAY 62 S | ACTIVE | VERIFIED |
MOLET | MICHAEL | NA | MALE | 26 | 629 | PINE FOREST | ACTIVE | VERIFIED |
KAUCHICK | PAULINE | NA | FEMALE | 26 | 5404 | BAKERS MILL | ACTIVE | VERIFIED |
FUQUA | WILLIAM | NA | MALE | 63 | 6757 | NC HIGHWAY 62 S | ACTIVE | VERIFIED |
WARREN | NA | JD | MALE | 68 | 43 | J AND S | ACTIVE | VERIFIED |
FRYE WILLIAM C | NA | II | MALE | 50 | 163 | GATES FOREST | ACTIVE | VERIFIED |
BURGESS | NA | NA | FEMALE | 29 | 0 | HORACE PERRY | ACTIVE | VERIFIED |
PHOENIX | NA | NA | FEMALE | 45 | 496 | MIRACLE MOUNTAIN | ACTIVE | VERIFIED |
JUDITH | NA | NA | FEMALE | 50 | 141 | HILLSIDE | ACTIVE | VERIFIED |
MALIK | NA | NA | MALE | 33 | 2517 | ASHBY WOODS | ACTIVE | VERIFIED |
ELSASS | NA | NA | MALE | 37 | 18924 | RIVER FALLS | ACTIVE | VERIFIED |
MAGENTA | NA | NA | FEMALE | 42 | 7324 | WINDYRUSH | ACTIVE | VERIFIED |
GRAYWOLF | NA | NA | MALE | 57 | 2402 | OTIS | ACTIVE | VERIFIED |
AMEN | NA | NA | MALE | 41 | 15 | BERRYMEADOW | ACTIVE | VERIFIED |
SILVERMOON | NA | NA | FEMALE | 40 | 5529 | SUNLIGHT | ACTIVE | VERIFIED |
PELKEY | CHARES | JR | MALE | 59 | 1395 | EVA | DENIED | UNAVAILABLE ESSENTIAL INFORMATION |
PITTS | DARRYL | NA | MALE | 19 | 1801 | FAYETTEVILLE | DENIED | VERIFICATION RETURNED UNDELIVERABLE |
LE SON | NA | NA | UNK | 35 | 5453 | SHALLOWFORD | DENIED | VERIFICATION RETURNED UNDELIVERABLE |
WHITFIELD KAY M | NA | NA | FEMALE | 79 | 819 | POOR | INACTIVE | CONFIRMATION NOT RETURNED |
MEDLIN | ROBERT E | NA | FEMALE | 0 | 0 | UNKNOWN | INACTIVE | CONFIRMATION RETURNED UNDELIVERABLE |
BRICE | NA | NA | MALE | 33 | 30 | OREGON | INACTIVE | CONFIRMATION RETURNED UNDELIVERABLE |
CAPARCO | NA | JEN | FEMALE | 33 | 326 | ELM | INACTIVE | CONFIRMATION RETURNED UNDELIVERABLE |
MORRISON | NA | NA | MALE | 34 | 9456 | LEXINGTON | INACTIVE | CONFIRMATION RETURNED UNDELIVERABLE |
BALLARD | LEIGH | NA | FEMALE | 29 | 200 | BARNES | REMOVED | ADMINISTRATIVE |
COTTEN | NA | NA | FEMALE | 83 | 0 | SNOW HILL | REMOVED | ADMINISTRATIVE |
SMITH | NA | NA | UNK | 0 | 0 | LAWSON | REMOVED | ADMINISTRATIVE |
0000000072294 | NA | NA | MALE | 46 | 1013 | FREDERICK | REMOVED | ADMINISTRATIVE |
ALSBROOKS | ELEANOR | NA | FEMALE | 90 | 98 | COUNTRY CLUB | REMOVED | DECEASED |
OXENDINE | MITCHEL | NA | MALE | 51 | 410 | RURAL RTE 1 | REMOVED | DECEASED |
WILLIS | MOLLIE | NA | FEMALE | 92 | 227 | RURAL RTE 1 | REMOVED | DECEASED |
ELLER | RETA KATHLEE | NA | FEMALE | 56 | 321 | MARTIN | REMOVED | DECEASED |
CHITTY | RUBEN | D | FEMALE | 98 | 0 | ROUTE 1 | REMOVED | DECEASED |
SELENE | NA | NA | FEMALE | 57 | 103 | ORCHARD | REMOVED | DECEASED |
LOWRY | NA | NA | MALE | 48 | 1515 | ST ANNA | REMOVED | DECEASED |
DE BRAGANZA | NA | NA | MALE | 93 | 2200 | BROOKFIELD | REMOVED | DECEASED |
MIDDLETON | C | NA | FEMALE | 84 | 223 | MIDDLETON | REMOVED | DUPLICATE |
BELL | JAI-MIL | NA | FEMALE | 23 | 0 | WSSU | REMOVED | DUPLICATE |
HWY | LIBERA | V | MALE | 69 | 0 | UNKNOWN | REMOVED | DUPLICATE |
OWENS | MICHELLE | NA | FEMALE | 24 | 524 | BAKER | REMOVED | DUPLICATE |
WILTON | SUSAN LORRAINE | NA | FEMALE | 50 | 105 | EDGEHILL | REMOVED | DUPLICATE |
ALLRED LINDA H | NA | NA | FEMALE | 66 | 3117 | WENTWORTH | REMOVED | DUPLICATE |
AMATO,KATHERINE,M | NA | NA | FEMALE | 50 | 1500 | PLYMOUTH | REMOVED | DUPLICATE |
AMIDON,PETER,LEVENT | NA | NA | MALE | 33 | 809 | CROSSBOW | REMOVED | DUPLICATE |
BEST,SYDNEY,ALLISON | NA | NA | FEMALE | 37 | 0 | NA | REMOVED | DUPLICATE |
BETHEA HAROLD LEE | NA | NA | FEMALE | 46 | 2011 | MILL POND | REMOVED | DUPLICATE |
BEVERLY CONSTANCE M | NA | NA | FEMALE | 37 | 316 | NA | REMOVED | DUPLICATE |
BOOZER ANNA KRISTEN | NA | NA | FEMALE | 36 | 0 | NA | REMOVED | DUPLICATE |
BOYD,ALLEN AUBREY,II | NA | NA | MALE | 35 | 0 | NA | REMOVED | DUPLICATE |
BRICE.MICHAEL ARTHUR | NA | NA | MALE | 37 | 3102 | LAKEHURST | REMOVED | DUPLICATE |
CARR,WENDELL,H JR | NA | NA | MALE | 36 | 524 | KINGSTON | REMOVED | DUPLICATE |
CATHEY,LONNIE,JR | NA | NA | MALE | 57 | 1507 | NA | REMOVED | DUPLICATE |
CLARK JOANNE BENNETT | NA | NA | FEMALE | 63 | 910 | CHURCH | REMOVED | DUPLICATE |
CUSTER,GEORGE D,JR | NA | NA | MALE | 51 | 7313 | GOODWILL CHURCH | REMOVED | DUPLICATE |
DAVID HYDE JR | NA | NA | MALE | 44 | 204 | TUCSON | REMOVED | DUPLICATE |
DAVISKMICHAEL EDWARD | NA | NA | MALE | 51 | 4411 | CORNELL | REMOVED | DUPLICATE |
DUBUISSON ALLISON B | NA | NA | FEMALE | 53 | 7713 | THURSTON | REMOVED | DUPLICATE |
FORRIS FAY ANN | NA | NA | FEMALE | 39 | 3005 | CRESTBROOK | REMOVED | DUPLICATE |
FULK,IVEY LEE,JR | NA | NA | MALE | 45 | 0 | NA | REMOVED | DUPLICATE |
GRIFFIN JANICE FAYE | NA | NA | FEMALE | 42 | 1603 | EMMA | REMOVED | DUPLICATE |
HALL,PONTHEOLA,M | NA | NA | FEMALE | 53 | 2110 | YEARDLEYS | REMOVED | DUPLICATE |
HANNER JO ANNE LONG | NA | NA | FEMALE | 61 | 4343 | WELLS | REMOVED | DUPLICATE |
HODNETT,DORGIE,JR | NA | NA | MALE | 52 | 223 | FAULKNER | REMOVED | DUPLICATE |
HOGSHEAD,THOMAS H,JR | NA | NA | MALE | 66 | 1108 | EAST GREENWAY | REMOVED | DUPLICATE |
JENKINS,JAMES W,JR | NA | NA | MALE | 36 | 5600 | MELVIN | REMOVED | DUPLICATE |
JONES,JOHNSIE,H | NA | NA | FEMALE | 92 | 3804 | CHAMPION | REMOVED | DUPLICATE |
KENNY MAHLON DAY | NA | NA | MALE | 84 | 18 | KACIA | REMOVED | DUPLICATE |
KEY,GENE SAMUEL,JR | NA | NA | MALE | 44 | 600 | TABERNACLE CHURCH | REMOVED | DUPLICATE |
LACKEY CAROL M | NA | NA | FEMALE | 70 | 5225 | NA | REMOVED | DUPLICATE |
LAMBERT DAVID M | NA | NA | MALE | 43 | 7726 | OAKCLIFFE | REMOVED | DUPLICATE |
LESANE JACQUELINE | NA | NA | FEMALE | 35 | 0 | NA | REMOVED | DUPLICATE |
MAPP,DWIGHT,BENJAMIN | NA | NA | MALE | 57 | 2422 | PLEASANT HILL | REMOVED | DUPLICATE |
MAY ROBERT BRYAN | NA | NA | FEMALE | 87 | 1614 | TILLERY | REMOVED | DUPLICATE |
MCCARTHY LISA ANNE | NA | NA | FEMALE | 44 | 5406 | AGATHA | REMOVED | DUPLICATE |
MICHELMJOSEPH JOHN | NA | NA | MALE | 40 | 1113 | HENDERSON | REMOVED | DUPLICATE |
NORTON MYRA WOODELL | NA | NA | FEMALE | 63 | 427 | CASSELL | REMOVED | DUPLICATE |
PEDIGO BUFORD T | NA | NA | MALE | 96 | 4315 | NA | REMOVED | DUPLICATE |
REDWINE MARK ALAN | NA | NA | MALE | 53 | 6 | KEANSBURG | REMOVED | DUPLICATE |
ROUSE,ESTHER, MAE | NA | NA | FEMALE | 52 | 611 | NA | REMOVED | DUPLICATE |
RUPOLO SANDRA | NA | NA | FEMALE | 36 | 517 | TUCSON | REMOVED | DUPLICATE |
SIMS,RAYMOND LEE,SR | NA | NA | MALE | 66 | 2209 | NA | REMOVED | DUPLICATE |
URQUHART PARK VASCO | NA | NA | MALE | 53 | 1108 | HENDERSON | REMOVED | DUPLICATE |
VALDEZ DONNA A | NA | NA | FEMALE | 43 | 4221 | SCOUT | REMOVED | DUPLICATE |
WALKER,CHARLES,JR | NA | NA | MALE | 56 | 339 | NA | REMOVED | DUPLICATE |
WESTMORELAND J C | NA | NA | MALE | 83 | 4621 | CAMP BURTON | REMOVED | DUPLICATE |
WHITAKER,JAMES L,JR | NA | NA | MALE | 35 | 2831 | NA | REMOVED | DUPLICATE |
WHITE,LEE E,JR | NA | NA | FEMALE | 35 | 0 | NA | REMOVED | DUPLICATE |
VAN DORSTEN | NA | NA | FEMALE | 105 | 3021 | COUNTRY CLUB | REMOVED | DUPLICATE |
BENSON | EUGENE | NA | MALE | 60 | 1525 | MAIN | REMOVED | FELONY CONVICTION |
STURDIVANT | NA | NA | MALE | 0 | 0 | NO NAME | REMOVED | FELONY CONVICTION |
STURDIVANT | NA | NA | MALE | 0 | 0 | NO NAME | REMOVED | FELONY CONVICTION |
BENTON | BINARD | NA | FEMALE | 46 | 8180 | SCOTCH MEADOWS | REMOVED | MOVED FROM COUNTY |
JACOBS | HUTTO | NA | FEMALE | 29 | 1415 | KELLY | REMOVED | MOVED FROM COUNTY |
HOLSHOUSER | LOUISE | NA | FEMALE | 23 | 291 | FORD CREEK | REMOVED | MOVED FROM COUNTY |
GREEN | LYNN | NA | FEMALE | 42 | 2033 | HAMLET CHAPEL | REMOVED | MOVED FROM COUNTY |
JOHNSON | MICHELLE | NA | FEMALE | 28 | 2506 | NC 10 | REMOVED | MOVED FROM COUNTY |
BLICK | MOORE | NA | FEMALE | 53 | 2947 | 8TH ST | REMOVED | MOVED FROM COUNTY |
MORRISON | SAIN | NA | FEMALE | 52 | 4006 | 10TH AV | REMOVED | MOVED FROM COUNTY |
BURGOYNE | STEPHANIE A | NA | FEMALE | 55 | 689 | FILLGATE | REMOVED | MOVED FROM COUNTY |
BARNES | VALRIE | NA | FEMALE | 56 | 410 | 12TH | REMOVED | MOVED FROM COUNTY |
FEARS | VANDERBILT | JR | MALE | 45 | 5436 | RIVER FALLS | REMOVED | MOVED FROM COUNTY |
PINION | WAYNE | NA | MALE | 63 | 4015 | NC 268 | REMOVED | MOVED FROM COUNTY |
SKELTON | WILLIAM | III | MALE | 40 | 166 | FRANKLIN | REMOVED | MOVED FROM COUNTY |
RAINEY | NA | NA | MALE | 0 | 19 | SPRUCE | REMOVED | MOVED FROM COUNTY |
SKIA | NA | NA | FEMALE | 45 | 213 | PALMER | REMOVED | MOVED FROM COUNTY |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | MOVED FROM COUNTY |
MAGENTA | NA | NA | FEMALE | 42 | 50 | HOLLY HILL | REMOVED | MOVED FROM COUNTY |
DE | NA | NA | MALE | 105 | 1124 | FOURTH | REMOVED | MOVED FROM COUNTY |
DE DEBORAH | NA | NA | FEMALE | 105 | 1124 | FOURTH | REMOVED | MOVED FROM COUNTY |
VAN EATON | NA | NA | MALE | 58 | 811 | BRUCE | REMOVED | MOVED FROM COUNTY |
TUIT | NA | NA | MALE | 21 | 980 | DEHART COMM CENTER | REMOVED | MOVED FROM COUNTY |
MARGO | (ONLY | NA | FEMALE | 62 | 119 | HILT | REMOVED | MOVED FROM STATE |
LEWIS | BUZBY | NA | FEMALE | 42 | 8218 | HERTFORD | REMOVED | MOVED FROM STATE |
RIVERS-MITCHELL | TRINA SAGE | NA | FEMALE | 30 | 402 | KYLE | REMOVED | MOVED FROM STATE |
BURNET | UNNI KJOSNES | NA | FEMALE | 72 | 4800 | WELWYN | REMOVED | MOVED FROM STATE |
HOCUTT CLAVON MORRIS | NA | NA | MALE | 58 | 178 | ROUNTREE | REMOVED | MOVED FROM STATE |
SEXTON | NA | NA | FEMALE | 53 | 2954 | ORCHID | REMOVED | MOVED FROM STATE |
ST JOHN | NA | NA | FEMALE | 44 | 6380 | LAMSHIRE | REMOVED | MOVED FROM STATE |
REARDON | JOSEPH | SR | MALE | 53 | 7122 | TUCKASEEGEE | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
LOVE | K | NA | MALE | 81 | 426 | TRYON | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
MASTON | MELISSA CHAN | NA | FEMALE | 34 | 0 | CLARK | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
VON LOTHENHEIGER | ROBIN | NA | FEMALE | 43 | 3511 | US 117 ALT | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
HOLLOMAN | NA | R | FEMALE | 100 | 701 | HIGH | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
BOSTIAN | NA | NA | FEMALE | 0 | 0 | UNKNOWN | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
JOLLY | NA | NA | FEMALE | 98 | 1001 | CRESCENT | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
GRAHAM GARLAND | NA | SR | MALE | 74 | 10 | CAROLINA PINES MHP | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
KANTHI | NA | NA | FEMALE | 56 | 2211 | NA | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
SCHAN | NA | NA | FEMALE | 35 | 332 | LOWER GRASSY BRANCH | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
STEWART-WOODS MARY O | NA | NA | FEMALE | 53 | 218 | NA | REMOVED | REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS |
BEST | CHARLES RAY | JR | MALE | 55 | 0 | ROUTE 3 | REMOVED | REMOVED UNDER OLD PURGE LAW |
KAAS | EDWARD | FRE | MALE | 76 | 112 | 5TH | REMOVED | REMOVED UNDER OLD PURGE LAW |
DAVENPORT | H | NA | FEMALE | 98 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
BEAUDION | JOHN | NA | MALE | 50 | 162 | COUTRY CLUB | REMOVED | REMOVED UNDER OLD PURGE LAW |
GRAHAM | JOHN | NA | MALE | 85 | 0 | RURAL RTE 65 | REMOVED | REMOVED UNDER OLD PURGE LAW |
GUNTER | LEE KLEIN | NA | FEMALE | 43 | 0 | PAST LECKA’S | REMOVED | REMOVED UNDER OLD PURGE LAW |
DANIELS | MARION | NA | MALE | 90 | 126 | MILL POND | REMOVED | REMOVED UNDER OLD PURGE LAW |
WOOD | NICOLE M | NA | FEMALE | 58 | 525 | 1ST | REMOVED | REMOVED UNDER OLD PURGE LAW |
BORIS | ROBERT | NA | MALE | 63 | 9211 | HORNIGOLD | REMOVED | REMOVED UNDER OLD PURGE LAW |
MOOREFIELD | ROBERT | STA | MALE | 49 | 1505 | VILLAGE | REMOVED | REMOVED UNDER OLD PURGE LAW |
JORDAN | TERRA | NA | FEMALE | 32 | 810 | PO BOX | REMOVED | REMOVED UNDER OLD PURGE LAW |
D’AIGNEAU | TRACY | ANN | FEMALE | 37 | 1201 | SWORDFISH | REMOVED | REMOVED UNDER OLD PURGE LAW |
NCT IS WRONG. SENT | NA | NA | FEMALE | 13 | 0 | SCENIC | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
MV 5/17/95 | NA | NA | MALE | 89 | 0 | MYERS CHAPEL | REMOVED | REMOVED UNDER OLD PURGE LAW |
PRINCE ALICE KAY | NA | NA | FEMALE | 50 | 0 | BYRDSVILLE MO HO PK | REMOVED | REMOVED UNDER OLD PURGE LAW |
LSEWHERE I | NA | NA | FEMALE | 39 | 72 | RURAL RTE 1 | REMOVED | REMOVED UNDER OLD PURGE LAW |
MILES IRENE K | NA | NA | FEMALE | 100 | 400 | RURAL RTE 1 | REMOVED | REMOVED UNDER OLD PURGE LAW |
CARROLL | NA | NA | FEMALE | 58 | 20 | ELLIOTT | REMOVED | REMOVED UNDER OLD PURGE LAW |
STEPHENS JEFFRYN G | NA | NA | FEMALE | 61 | 112 | ESTES | REMOVED | REMOVED UNDER OLD PURGE LAW |
HENDERSON RAY MICH | NA | NA | MALE | 53 | 35 | DAVIE | REMOVED | REMOVED UNDER OLD PURGE LAW |
MENENDEZ-ZALACAIN | NA | NA | FEMALE | 58 | 1103 | GREENSBORO | REMOVED | REMOVED UNDER OLD PURGE LAW |
LASSITE | NA | NA | MALE | 54 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
MILLER JOHN KNOX | NA | NA | MALE | 83 | 604 | TINKERBELL | REMOVED | REMOVED UNDER OLD PURGE LAW |
DEL ROSSO FRANCES | NA | NA | FEMALE | 81 | 1407 | THE OAKS APTS | REMOVED | REMOVED UNDER OLD PURGE LAW |
MEEKER MICHAEL GAI | NA | NA | MALE | 58 | 0 | ORANGE GROVE | REMOVED | REMOVED UNDER OLD PURGE LAW |
PRICE INEZ KEETER | NA | NA | FEMALE | 72 | 0 | RURAL RTE 2 | REMOVED | REMOVED UNDER OLD PURGE LAW |
RUTT CHARLES E | NA | NA | MALE | 64 | 0 | ORANGE GROVE | REMOVED | REMOVED UNDER OLD PURGE LAW |
SUNDSTROM MARY BRE | NA | NA | FEMALE | 55 | 52 | FLINT RIDGE APTS | REMOVED | REMOVED UNDER OLD PURGE LAW |
PENDERGRAPH ADA W | NA | NA | FEMALE | 105 | 316 | RURAL RTE 2 | REMOVED | REMOVED UNDER OLD PURGE LAW |
WILKINS TERESA ELL | NA | NA | FEMALE | 50 | 0 | COUNTRY SQUIRE MO HO PK | REMOVED | REMOVED UNDER OLD PURGE LAW |
FERRETTIJ THOMAS A | NA | NA | MALE | 59 | 0 | SR 1115 | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
BRADSHAW | NA | NA | MALE | 49 | 0 | WITTY | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
XXX | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
X | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NEW TEST | NA | NA | UNK | 0 | 15 | NO | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
ARRINGTON JULI | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | 08 | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | UNKNOWN | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
N | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
0 | NA | NA | FEMALE | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
NA | NA | NA | UNK | 0 | 0 | NA | REMOVED | REMOVED UNDER OLD PURGE LAW |
There are very few records missing first name or last name, and most of them are REMOVED status. The easiest thing to do is just get rid of those records.
Exclude records with missing first or last name
d %>% dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "[a-z]"))
# A tibble: 50 x 1
last_name
<chr>
1 McCLURE
2 McCULLLEY
3 DeNOON
4 DeSIMON
5 DeSIMON
6 DeVANE
7 DeVANE
8 LeMASTER
9 MaCDONELL
10 MaCDONELL
# … with 40 more rows
d %>% dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "[a-z]"))
# A tibble: 24 x 1
first_name
<chr>
1 JoANN
2 LaVERNE
3 BettyJEAN
4 JoANNE
5 LaWANDA
6 LaVAN
7 JoANN
8 LaDORA
9 JoANN
10 SiROBERT
# … with 14 more rows
d %>% dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "[a-z]"))
# A tibble: 169 x 1
midl_name
<chr>
1 McBRIDE
2 McBRIDE
3 McCLENNY
4 McLEAN
5 LaVERNE
6 McCLEASE
7 McDAY
8 McCOLLUM
9 McKINNIE
10 McLAWHORN
# … with 159 more rows
Map all letters to upper case
d %>% dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "[0-9]"))
# A tibble: 90 x 1
last_name
<chr>
1 HOLLERS 111
2 GALL0WAY
3 MV 5/17/95
4 01
5 YARBOR0
6 J0HNSON
7 LEAK 111
8 BURT0N
9 REYN0LDS
10 4MCMANUS
# … with 80 more rows
d %>% dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "[0-9]"))
# A tibble: 81 x 1
first_name
<chr>
1 HERM0N
2 BL0SSIE
3 J0HN
4 J0HNNY
5 MAJ0R
6 J0NATHAN
7 J0SEPH
8 L0RI
9 LEPOLE0N
10 J0 ELLEN
# … with 71 more rows
d %>% dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "[0-9]"))
# A tibble: 299 x 1
midl_name
<chr>
1 MIZELLE25248249
2 0VERTON
3 111
4 RAY 1.
5 0DELL
6 OLLIE 111
7 ARGUS 4TH
8 3RD.
9 LYN451
10 JAMES 111
# … with 289 more rows
Look at the digits individually.
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "0"))
dim(x)
[1] 67 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "0" "0000000072294" "01"
[4] "0WENS" "ALEM0N" "AN0Y0"
[7] "BISH0P" "BOLAD0" "BURT0N"
[10] "C0NNOR" "C0STNER" "CAPUT0"
[13] "CAUSIEESTK0-LEE" "CL0NTZ" "CLEMM0NS"
[16] "CONN0R" "CONR0Y" "CR0NE"
[19] "D0LLARS" "D0WNS" "DAWY0T"
[22] "DIVINCENZ0" "EAT0N" "ESC0BEDO"
[25] "FERGUS0N" "FERNANDEZ-BRAV0" "GALL0WAY"
[28] "GOM0" "GUARDAD0" "HIGUER0-JAMES"
[31] "J0HNSON" "JOHNS0N" "JORDAN-R0BERTS"
[34] "KEAT0N" "KOCH0NEAL" "KONI0R"
[37] "L0CKLEAR" "MCC0Y" "MCD0UGAL"
[40] "ND0H" "OCONN0R" "P0RTER"
[43] "P0WERS" "PEREZ-NAVARR0" "PULL0"
[46] "R0CCANOVA" "R0CCO" "R0DRIGUEZ"
[49] "REYN0LDS" "ROSK0S-SHAMBERGER" "RUSS0"
[52] "SAMARG0" "SCAMARD0" "SIMPS0N"
[55] "SOLTER0" "SOOTO0" "ST0LTZ"
[58] "TANHEHC0" "TAYL0R" "THOMPS0N"
[61] "WINST0N" "WIT0SKY" "WO0DARD"
[64] "YARBOR0" "YATSK0"
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "0"))
dim(x)
[1] 73 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
[1] "0" "ALLIS0N" "ALONZ0" "ANDREA-0" "ANTONI0"
[6] "AZAVI0US" "B0BBY" "B0NNIE" "B0YCE" "BL0SSIE"
[11] "C0LBY" "C0RDELIA" "CAR0LE" "CAR0LYN" "CHERYL0N"
[16] "CHRIST0PHER" "D0LORES" "D0NNA" "DELI0" "DONNA CAR0"
[21] "DOR0THY" "GREG0RY" "HERM0N" "J0" "J0 ANN"
[26] "J0 ELLEN" "J0AN" "J0HN" "J0HNNY" "J0NATHAN"
[31] "J0SEPH" "JONATH0N" "K0LTON" "KAR0N" "L0RI"
[36] "L0UIZETTA" "LEPOLE0N" "M0NICA" "M0NIKA" "MAJ0R"
[41] "MARI0N" "MARY-J0" "MICHAEL TR0" "NAT0SHA" "ORLAND0"
[46] "OTH0" "P0LLY" "PLACID0" "R0BERT" "R0Y"
[51] "REYNALD0" "RODRIG0" "S0NTE" "SHANN0N" "T0NYA"
[56] "TIM0THY" "V0NCIEAL" "Y0LANDA"
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "0"))
dim(x)
[1] 130 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "0" "0 CYRUS" "0'BRIAN" "0'CONNOR"
[5] "0ATES" "0DEL" "0DELL" "0MAE"
[9] "0ROURKE" "0VERTON" "10052004" "103"
[13] "205" "2205" "8017" "ALEXANDER080572"
[17] "ALPHONS0" "ANDERSON9104576" "ANN B0YD" "ANTH0NY"
[21] "AY0" "BA0-KUO" "C1010" "CO0PER"
[25] "COL0N" "CR0XIN" "D0N" "D0RIS"
[29] "D0UGLAS" "DALE401" "DEANGEL0" "DEV0NA"
[33] "DIO0NE" "DON0HOO" "EDWARDS1801" "ELAINE1000"
[37] "ELLI0TT" "EMETRIC0" "EN0" "F0REST"
[41] "FINLEY500 SU" "FRANT0NIO" "H0USTON" "J0"
[45] "J0 MARINOVIC" "J0E" "J0HN" "J0NES"
[49] "JONATH0N" "JOYCE701" "JUNI0R" "L0CKAMY"
[53] "L0UISE" "LAM0ND" "LAT0NYA" "LAV0NE"
[57] "LE0N" "LEE3708" "LORENZ0" "LOUIS7100"
[61] "LY0NS" "LYNN1820" "M00RE" "M0NGE"
[65] "M0NIQUE" "M0RALES" "MARIE103062" "NICH0LE"
[69] "NICH0LS" "OCONN0R" "ORLAND0" "P0RTER"
[73] "PESATUR0" "R0BERT" "R0CHELLE" "R0DGERS"
[77] "R0Y" "ROBINS0N" "ROSENBAUM3305" "RUNY0N"
[81] "SAMBRAN0" "SC0TT" "SCOTT3450" "SH0RROD"
[85] "T0DD" "T0NY" "TAYL0R" "TH0MPSON"
[89] "TOME0" "V0SS" "VALENTIN0" "W00LARD"
[93] "WAYNE030986" "WRIGHT2106" "Y0LONDA" "Y0UNG"
Map zero to O if name contains at least one letter and no digits 1-9
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "1"))
dim(x)
[1] 20 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "01" "1" "491715" "971"
[5] "CARR 111" "CHASTAIN 11" "CLARK 111" "COMER 111"
[9] "COX 1V" "HINES 111" "HOLLERS 111" "LATTA 111"
[13] "LEAK 111" "MELTON 111" "MV 5/17/95" "PEELE 11"
[17] "SATTERFIELD 111" "SPATCHER 111" "TUCKER 11" "WASHINGTON 111"
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "1"))
dim(x)
[1] 3 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
[1] "DAVID 111" "ELIZABE1H" "ROSE1"
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "1"))
dim(x)
[1] 163 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "10052004" "103" "11" "111"
[5] "1V" "8017" "A 111" "ANDERSON9104576"
[9] "ANN155" "B 11" "B 111" "C 111"
[13] "C1010" "D 11" "DALE401" "EDWARDS1801"
[17] "ELAINE1000" "EUGENE 11" "FRANCIS 11" "FRANKLIN 1V"
[21] "H 11" "H 111" "HODGES 111" "HOUSTON 11"
[25] "HOYLE 111" "J1-TO" "JAMES 111" "JONA1"
[29] "JOYCE701" "LOUIS7100" "LYN451" "LYNN1820"
[33] "LYNN2513" "M 111" "M1" "MARIE103062"
[37] "MARION 111" "MASON 111" "MICHAEL146" "N 111"
[41] "NADINE DOUGLAS1" "OLLIE 111" "RANDOLPH 111" "RAY 1."
[45] "ROYAL 111" "T 111" "THOMAS 111" "VERNON 111"
[49] "W 111" "WILLIAM 11" "WILLIAM 111" "WILLIAM1"
[53] "WM 111" "WRIGHT2106"
Delete generation suffixes where possible
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "2"))
dim(x)
[1] 1 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "0000000072294"
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "2"))
dim(x)
[1] 1 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
[1] "MICHAEL DEAN 2"
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "2"))
dim(x)
[1] 13 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "10052004" "205" "2205" "328"
[5] "4625" "4932" "ALEXANDER080572" "B2957"
[9] "LYNN1820" "LYNN2513" "MARIE103062" "MIZELLE25248249"
[13] "WRIGHT2106"
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "3"))
dim(x)
[1] 1 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "3"
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "3"))
dim(x)
[1] 0 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
character(0)
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "3"))
dim(x)
[1] 13 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "103" "328" "3RD." "4932"
[5] "LEE3708" "LYNN2513" "MACK 3RD" "MARIE103062"
[9] "MITCHELL368" "ROSENBAUM3305" "SANFORD-3" "SCOTT3450"
[13] "WAYNE030986"
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "4"))
dim(x)
[1] 3 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "0000000072294" "491715" "4MCMANUS"
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "4"))
dim(x)
[1] 1 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
[1] "FR4ANK"
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "4"))
dim(x)
[1] 15 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "10052004" "4625" "4932" "ANDERSON9104576"
[5] "ANN BURTON47" "ARGUS 4TH" "DALE401" "JAM4S"
[9] "LYN451" "MCREE 4" "MICHA4EL" "MICHAEL146"
[13] "MIZELLE25248249" "SCOTT3450" "TE4S"
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "5"))
dim(x)
[1] 3 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "491715" "ALBER5TSON" "MV 5/17/95"
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "5"))
dim(x)
[1] 0 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
character(0)
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "5"))
dim(x)
[1] 17 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "(NMN)5TH" "10052004" "205" "2205"
[5] "4625" "ALEXANDER080572" "ANDERSON9104576" "ANN155"
[9] "B2957" "FINLEY500 SU" "LUTHER5" "LYN451"
[13] "LYNN2513" "MIZELLE25248249" "ROSENBAUM3305" "SCOTT3450"
[17] "W5RAY"
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "6"))
dim(x)
[1] 1 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "6"
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "6"))
dim(x)
[1] 1 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
[1] "RETT6A"
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "6"))
dim(x)
[1] 7 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "4625" "ANDERSON9104576" "MARIE103062" "MICHAEL146"
[5] "MITCHELL368" "WAYNE030986" "WRIGHT2106"
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "7"))
dim(x)
[1] 4 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "0000000072294" "491715" "971" "MV 5/17/95"
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "7"))
dim(x)
[1] 0 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
character(0)
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "7"))
dim(x)
[1] 8 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "8017" "ALEXANDER080572" "ANDERSON9104576" "ANN BURTON47"
[5] "B2957" "JOYCE701" "LEE3708" "LOUIS7100"
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "8"))
dim(x)
[1] 0 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
character(0)
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "8"))
dim(x)
[1] 2 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
[1] "BEA LOUI8" "J8IMMIE"
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "8"))
dim(x)
[1] 9 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "328" "8017" "ALEXANDER080572" "EDWARDS1801"
[5] "LEE3708" "LYNN1820" "MITCHELL368" "MIZELLE25248249"
[9] "WAYNE030986"
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "9"))
dim(x)
[1] 4 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "0000000072294" "491715" "971" "MV 5/17/95"
x <- d %>%
dplyr::select(first_name) %>%
dplyr::filter(stringr::str_detect(first_name, "9"))
dim(x)
[1] 0 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(first_name) %>%
dplyr::pull(first_name)
character(0)
x <- d %>%
dplyr::select(midl_name) %>%
dplyr::filter(stringr::str_detect(midl_name, "9"))
dim(x)
[1] 6 1
x %>%
dplyr::distinct() %>%
dplyr::arrange(midl_name) %>%
dplyr::pull(midl_name)
[1] "4932" "ANDERSON9104576" "B2957" "LO9UIS"
[5] "MIZELLE25248249" "WAYNE030986"
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "\\s"))
dim(x)
[1] 13637 1
x %>%
dplyr::slice_head(n = 100) %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "ABD SHAKUR" "ABD SHAKUR" "AL HUSSAINA"
[4] "ARNOLD DEW" "BENDER JR" "DA SILVA"
[7] "DA SILVA" "DA SILVA" "DE BRADY"
[10] "DEL MAURO" "DEL ROSARIO" "DES JARDINS"
[13] "DI LORENZO" "DU BOIS" "HOLLERS 111"
[16] "KROMIS BRESNIHAN" "LA MOTTE" "LAMBERT JR"
[19] "LE BLANC" "LE FEVER" "LE MAY"
[22] "MAC CRINDLE" "MAC DONALD" "MAC DOWELL"
[25] "MAC DOWELL" "MC ANIFF" "MC ANIFF"
[28] "MC CADEN" "MC CADEN" "MC CADEN"
[31] "MC CADEN" "MC CADEN" "MC CADEN"
[34] "MC CADEN" "MC COY" "MC COY"
[37] "MC COY" "MC COY" "MC CRAY"
[40] "MC GARR" "MC GHEE" "MC GHEE"
[43] "MC GHEE" "MC GUIRE" "MC MANNEN"
[46] "MC MULLEN" "MC NAIR" "MCMILLIAN (MUMFO"
[49] "MCQUEEN (MORRISE" "MILLS- KHARBAT" "NCT IS WRONG. SENT"
[52] "O BRIEN" "O HARA" "O NEAL"
[55] "O NEAL" "PARISH (RAMON)" "REDFEARN- SHELTON"
[58] "ST CLAIR" "ST CLAIR" "ST CLAIR"
[61] "ST CLAIR" "ST CLAIR" "ST CLAIR"
[64] "ST CLAIR" "ST LOUIS" "ST ONGE"
[67] "ST PIERR" "ST SING" "ST SING"
[70] "ST SING" "SYKES (BRICKHOUSE)" "TIPTON- BARNARD"
[73] "VAN BALEN" "VAN BUSKIRK" "VAN DEVENTER"
[76] "VAN DONSEL" "VAN DORPE" "VAN DYKE"
[79] "VAN DYKE" "VAN ETTEN" "VAN HORN"
[82] "VAN HORN" "VAN HORN" "VAN HORN"
[85] "VAN LOTON" "VAN MEIR" "VAN SCHOLK"
[88] "VAN SUTPHIN" "VAN ZANDLE" "VAN ZANDLE"
[91] "VANDER STOKKER" "VON BIBERSTEIN" "VON BIBERSTEIN"
[94] "VON BIBERSTEIN" "VON BIBERSTEIN" "VON BIBERSTEIN"
[97] "WATTS ST PIERREE" "WHITFIELD KAY M" "YELLOW ROBE"
[100] "YELLOW ROBE"
Map whitespace to empty string
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "-"))
dim(x)
[1] 34325 1
x %>%
dplyr::slice_head(n = 100) %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "AB-HUGH" "AB-HUGH" "ABDUL-GHAFFAR"
[4] "ABDUL-GHAFFAR" "ABDUL-KARRIEM" "ABDUL-RABB"
[7] "ABDUL-RAHIM" "ABDUL-RAHIN" "ABDUL-RAHMAN"
[10] "ABDUL-SALAAM" "ABDUL-SALAM" "ABDUL-WAHID"
[13] "ABDUR-RAHIM" "ABDUR-RAHMAN" "ABU-DAMES"
[16] "ABU-SABA" "ABU-SABA" "ABU-SABA"
[19] "ADAMS-CASKIE" "ADAMS-MYERS" "AFRICA-FLOYD"
[22] "AL-AWAR" "AL-AWAR" "AL-AWAR"
[25] "AL-KURDI" "AL-SAADI" "AL-SAADI"
[28] "ALBERT-KEULAN" "ALSTON-EATMON" "ANDERSON-TESH"
[31] "APPLEWHITE-LEWIS" "ARDITO-BARLETTA" "ARMSTRONG-VANN"
[34] "ARTHUR-CORNETT" "ASKINS-MYRICK" "AWTREY-KIRKMAN"
[37] "BAILEY-BROOKS" "BARNARD-BAILEY" "BENNETT-CLOWNEY"
[40] "BENTLEY-HALE" "BIBB-FREEMAN" "BLAKE-HASKINS"
[43] "BLEKFELD-SZTRAKY" "BLEVINS-SPRINKLE" "BLUE-SWANN"
[46] "BRADY-WILSON" "BROWN-CORNELIUS" "BRUCE-ROSS"
[49] "BUCKLEY-MOORE" "CLARK-BARKER" "CLAUDIO-DIAZ"
[52] "CLAUDIO-DIAZ" "CLAUDIO-DIAZ" "CLAUDIO-DIAZ"
[55] "COLE-MORGAN" "CROWELL-SMITH" "DAVIS-BOYD"
[58] "DAVIS-PARKER" "DAVIS-ROBINSON" "DUFFER-LEECHFORD"
[61] "EATON-ALSTON" "ELLIS-WALLACE" "ENGEL-BAKER"
[64] "GILLIS-HENDELL" "GORDON-WICKER" "GREEN-HOLLEY"
[67] "GUPTA-THOMAS" "HARGETT-LILLY" "HIATT-CRIBBS"
[70] "JONES-ALEXANDER" "JONES-SUTTON" "KELLER-HULL"
[73] "KOSKI-PONTON" "KUCERA-HOFFMANN" "LAWS-GRIFFIN"
[76] "LEARY-SMITH" "LIDE-GRANT" "LITTON-MCKENZIE"
[79] "LOCKLEAR-CASEY" "LOCKLEAR-CRABTREE" "MANESS-LITTLE"
[82] "MAYNOR-BOWEN" "MILLS- KHARBAT" "MURPHY-GRAY"
[85] "PARKER-LOWE" "PARRA-ASH" "POOLE-JENKINS"
[88] "POPISH-SMITH" "RAY-LEAZER" "REDFEARN- SHELTON"
[91] "RIDDICK-HARRELL" "RIVERA-MONTORO" "SEVORES-AMMONS"
[94] "SORRELLS-COOPER" "STEPHENS-HORTON" "TIPTON- BARNARD"
[97] "TOMBLIN-WELLMAN" "WALLIS-JOHNSON" "WATKINS-AKERS"
[100] "WHITAKER-LINDSAY"
Map hyphen to empty string
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "'"))
dim(x)
[1] 9712 1
x %>%
dplyr::distinct() %>%
dplyr::slice_head(n = 100) %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] "BOURR'E" "BOVE'" "D'ALPHE"
[4] "D'AMBROSIO" "D'AMICO" "D'ANGELO"
[7] "D'ANGIO" "D'ANNUNZIO" "D'ANTIGNAC"
[10] "D'ARCO" "D'ARMOND" "D'ARVILLE"
[13] "D'ASCOLI" "D'AUGUSTA" "D'AURIA"
[16] "D'AUTRECHY" "D'AVANZO" "D'EMPAIRE"
[19] "D'ERCOLE" "D'HEMECOURT" "D'IGNAZIO"
[22] "D'INDIA" "D'ONOFRIO" "D'SANT"
[25] "DEBELL-O'NEAL" "DEL RE'" "DELL'OSSO"
[28] "DUARTE'" "L'ETOILE" "L'HUILLIER"
[31] "LACHARITE'-OTWELL" "O' NEAL" "O'BANION"
[34] "O'BANNON" "O'BERRY" "O'BRIAN"
[37] "O'BRIANT" "O'BRIEN" "O'BRYAN"
[40] "O'BRYANT" "O'BRYON" "O'BYRNE"
[43] "O'CARROLL" "O'CONNEL" "O'CONNELL"
[46] "O'CONNER" "O'CONNOR" "O'CONWELL"
[49] "O'DANIEL" "O'DEA" "O'DEAR"
[52] "O'DEAR BROOKS" "O'DELL" "O'DOM"
[55] "O'DONALD" "O'DONNEL" "O'DONNELL"
[58] "O'DRISCOLL" "O'FARRELL" "O'FERRELL"
[61] "O'GARA" "O'GEARY" "O'GRADY"
[64] "O'GUIN" "O'GWYNN" "O'HARA"
[67] "O'HERN" "O'KANE" "O'KEEFE"
[70] "O'KELLEY" "O'KELLY" "O'KONEK"
[73] "O'LAUGHLIN" "O'LEARY" "O'MAHONY"
[76] "O'MARA" "O'NEAL" "O'NEAL-BIGGS"
[79] "O'NEAL-CLEMENTS" "O'NEAL-WRIGHT" "O'NEIL"
[82] "O'NEILL" "O'PHARROW" "O'QUIN"
[85] "O'QUINN" "O'REAR" "O'REILLY"
[88] "O'RILEY" "O'RORK" "O'ROUKE"
[91] "O'ROURKE" "O'SHAUGHNESSY" "O'SHEA"
[94] "O'SHIELD" "O'SHIELDS" "O'STEEN"
[97] "O'SULLIVAN" "O'TOOLE" "O'TUEL"
[100] "SOLLE'"
Map single quote to empty string
d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "\""))
# A tibble: 1 x 1
last_name
<chr>
1 "LA\"BEE"
Map all double quotes to single quotes
x <- d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "[^-\\s'\"a-zA-Z]"))
dim(x)
[1] 294 1
x %>%
dplyr::distinct() %>%
dplyr::slice_head(n = 100) %>%
dplyr::arrange(last_name) %>%
dplyr::pull(last_name)
[1] ";PGEMAN" "01" "3"
[4] "491715" "4MCMANUS" "AMATO,KATHERINE,M"
[7] "AMIDON,PETER,LEVENT" "AN0Y0" "BAKER (MCFADYEN)"
[10] "BAREFOOT (RHINE)" "BELL,MITCHELL THOMAS" "BEST,SYDNEY,ALLISON"
[13] "BINGHAM JR." "BOYD,ALLEN AUBREY,II" "BRICE.MICHAEL ARTHUR"
[16] "BRINKLEY/BAGGS" "BRITTAIN/SPRINKLE" "BROWN,FREDERIC CHEST"
[19] "BROWN,ROBERT EDWARD," "BUCHANAN,SAMMY JOE,J" "BUNTON,RAYMOND AVNEY"
[22] "BURGESS,WINFRED LEE," "BURNETTE,TOMMY WILLI" "BURT0N"
[25] "BURWELL JR." "BYRD/ROBERTS" "CARR 111"
[28] "CARR,WENDELL,H JR" "CARSON (WADE)" "CASH/GODWIN"
[31] "CATHEY,LONNIE,JR" "CHASTAIN 11" "COLLINS (SISTER)"
[34] "COMER 111" "COTHERN (BLAKE)" "COX 1V"
[37] "EDENS (ARCHAMBAU" "EVANS (ABBOTT)" "FEE (SISTER)"
[40] "FORTNER,II" "FOSTER (KING)" "GALL0WAY"
[43] "GARNER/MCGRAW" "HINES 111" "HOLLERS 111"
[46] "HUDSON (HALL)" "J0HNSON" "JORDAN-R0BERTS"
[49] "KEAT0N" "KINLAW (GUIN)" "LAIL/OXENTINE"
[52] "LEAK 111" "LYTLE/FORNEY" "MCC0Y"
[55] "MCCLAIN (SISTER)" "MCDONOUGH (SISTER)" "MCMILLIAN (MUMFO"
[58] "MCQUEEN (MORRISE" "MOCCIA (SMITH)" "MOORING,MOLLY"
[61] "MORRIS/BLOOM" "MV 5/17/95" "NCT IS WRONG. SENT"
[64] "NICHOLS (NORTON)" "NICHOLS/BROWN" "O;NEAL"
[67] "O`BRIANT" "PALMER(BRIGGS)" "PARISH (RAMON)"
[70] "PUCKETT`" "RAMSEY/DOBERT" "REYN0LDS"
[73] "RHONEY/PETERS" "RIDGWAY;" "ROGERS,JR."
[76] "SIDI/HIDA" "SIMPS0N" "SMELT/PEARSON"
[79] "SMITH/COOPER" "SPATCHER 111" "ST. CLAIR"
[82] "ST. DENIS" "ST. GEORGE" "ST. LAWRENCE"
[85] "ST.CLAIR" "ST.GEORGE" "ST.GERMAINE"
[88] "STUTLER/JAGGERS" "SWYGERT/SMITH" "SYKES (BRICKHOUSE)"
[91] "TRIVETTE JR." "TUCKER 11" "VALKENAAR ."
[94] "WATERS/CRUZ" "WEATHERINGTON,III" "WILSON JR."
[97] "WO0DARD" "WOODARD/YANTES" "WOODARD`"
[100] "YARBOR0"
Look at those in more detail.
d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "\\."))
# A tibble: 44 x 1
last_name
<chr>
1 NCT IS WRONG. SENT
2 ROGERS,JR.
3 VALKENAAR .
4 ST.GEORGE
5 ST. GEORGE
6 BINGHAM JR.
7 ST. LAWRENCE
8 WILSON JR.
9 TRIVETTE JR.
10 ST. CLAIR
# … with 34 more rows
name_sufx_cd
field.Map period to empty string
Move suffix to suffix field
d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, ","))
# A tibble: 63 x 1
last_name
<chr>
1 AMATO,KATHERINE,M
2 AMIDON,PETER,LEVENT
3 ROGERS,JR.
4 BELL,MITCHELL THOMAS
5 BEST,SYDNEY,ALLISON
6 WEATHERINGTON,III
7 BOYD,ALLEN AUBREY,II
8 BROWN,FREDERIC CHEST
9 FORTNER,II
10 BROWN,ROBERT EDWARD,
# … with 53 more rows
last_name
Map comma to empty string
Move suffix to suffix field
d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "\\*"))
# A tibble: 7 x 1
last_name
<chr>
1 O*TOOLE
2 O*TOOLE
3 O*NEAL
4 O*MASTERS
5 D*AMICO
6 D*AMICO
7 O*BRIEN
Map asterisk to empty string
d %>%
dplyr::select(last_name, first_name, sex) %>%
dplyr::filter(stringr::str_detect(last_name, "/"))
# A tibble: 46 x 3
last_name first_name sex
<chr> <chr> <chr>
1 GARNER/MCGRAW JOANN FEMALE
2 RHONEY/PETERS DONNA FEMALE
3 MV 5/17/95 <NA> MALE
4 SIDI/HIDA DEBORAH FEMALE
5 STUTLER/JAGGERS MELANIE FEMALE
6 MORRIS/BLOOM TERESA FEMALE
7 BRINKLEY/BAGGS MICHELLE FEMALE
8 RAMSEY/DOBERT AMY FEMALE
9 WATERS/CRUZ ELIZABETH FEMALE
10 BRITTAIN/SPRINKLE SANDRA FEMALE
# … with 36 more rows
Map slash to empty string
d %>%
dplyr::select(last_name, first_name, sex) %>%
dplyr::filter(stringr::str_detect(last_name, "\\\\"))
# A tibble: 4 x 3
last_name first_name sex
<chr> <chr> <chr>
1 "PUTNAM\\" TAMARA FEMALE
2 "STRTHEIT\\" LOLA FEMALE
3 "BUFFKIN\\" WESLEY MALE
4 "GOSHEN\\" DIXIE FEMALE
Map backslash to empty string
d %>%
dplyr::select(last_name, first_name, sex) %>%
dplyr::filter(stringr::str_detect(last_name, "`"))
# A tibble: 10 x 3
last_name first_name sex
<chr> <chr> <chr>
1 O`BRIANT DIANE FEMALE
2 O`BRIANT WILLIAM MALE
3 WOODARD` JASON MALE
4 PUCKETT` LEANDRA FEMALE
5 BRYANT` WILLIAM MALE
6 GODWIN` PATRICIA FEMALE
7 MORRISON` HAZEL FEMALE
8 BOYLES` LINDA FEMALE
9 HARRISON` TRACI FEMALE
10 CASEY` LONNIE MALE
Map back-tick to empty string
d %>%
dplyr::select(last_name, first_name, sex) %>%
dplyr::filter(stringr::str_detect(last_name, "~"))
# A tibble: 1 x 3
last_name first_name sex
<chr> <chr> <chr>
1 O~CONNOR-LEWIS BELINDA FEMALE
Map tilde to empty string
d %>%
dplyr::select(last_name, first_name, sex) %>%
dplyr::filter(stringr::str_detect(last_name, "_"))
# A tibble: 1 x 3
last_name first_name sex
<chr> <chr> <chr>
1 SOLARZ_VOJDANI JENNIFER FEMALE
Map underscore to empty string
d %>%
dplyr::select(last_name, first_name, sex) %>%
dplyr::filter(stringr::str_detect(last_name, "%"))
# A tibble: 1 x 3
last_name first_name sex
<chr> <chr> <chr>
1 SCHERM%MARTIN WYATT FEMALE
Map percent to empty string
d %>%
dplyr::select(last_name) %>%
dplyr::filter(stringr::str_detect(last_name, "[^-\\s'\"a-zA-Z015\\.,\\*/\\\\`~_%]"))
# A tibble: 37 x 1
last_name
<chr>
1 ;PGEMAN
2 MCMILLIAN (MUMFO
3 MCQUEEN (MORRISE
4 SYKES (BRICKHOUSE)
5 PARISH (RAMON)
6 BAREFOOT (RHINE)
7 MV 5/17/95
8 HUDSON (HALL)
9 FOSTER (KING)
10 KINLAW (GUIN)
# … with 27 more rows
UP TO HERE
Look at those in more detail.
Look at frequencies of names.
d %>%
dplyr::select(last_name) %>%
dplyr::count(last_name, sort = TRUE)
# A tibble: 269,313 x 2
last_name n
<chr> <int>
1 SMITH 105215
2 WILLIAMS 74940
3 JOHNSON 70103
4 JONES 69712
5 BROWN 57265
6 DAVIS 54381
7 MOORE 40564
8 MILLER 36539
9 WILSON 35738
10 TAYLOR 33519
# … with 269,303 more rows
name_sufx_cd
: Voter name suffix
d %>% dplyr::select(name_sufx_cd) %>% skimr::skim()
Name | Piped data |
Number of rows | 8003293 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name_sufx_cd | 7561920 | 0.06 | 1 | 3 | 0 | 222 | 0 |
table(d$name_sufx_cd, useNA = "ifany")
? ' (GE (II (JR (SR \\ ` 0 040
2 2 1 1 4 1 2 20 3 1
070 072 08 1 106 11 111 134 15 181
1 1 1 7 1 101 241 1 1 1
1V 2 2ND 3 346 39 3RD 5 5TH 77
5 4 1 1 1 1 14 1 2 1
8 8TH 9 A AJR AKB ALB ALM ANN ARK
1 1 1 1 1 1 1 1 6 1
ART ARV B BAL BAS BAU BEA BEL BEN BOU
1 1 6 1 1 1 2 1 1 1
BRA BRI BRO BUC BUN C C. CAM CHA CLA
1 3 1 1 1 10 1 1 1 1
COY CRA CUB CUM CUT D DAN DAV DIC DIG
2 1 1 1 1 6 3 1 1 1
DO DOR DOU DOV DR DR. E EDW ELE ELI
3 1 1 1 1 4 5 1 1 1
ELS ETT EWA EY F F M FAU FOR FRE G
1 1 1 1 7 1 1 2 2 4
GLE GRE GUY H HAM HIL HOG HOO HUS I
1 1 1 3 1 1 1 2 1 566
II II. III IIL ILI IN ING IRM ITH IV
26023 3 56928 1 1 2 1 1 1 6955
IV. IX J JAC JAM JD JEN JOH JON JOS
2 1 17 1 1 4 1 1 1 2
jr JR JR, Jr. JR. K KAP KEN KIN KIT
1 295262 1 2 2832 4 1 1 1 1
L LAR LEE LEN LES LEW LIN LL LLL LOC
8 1 2 1 1 1 1 3 2 1
LOU LYN M M D MAC MAE MAT MCK MCQ MCR
2 1 11 1 1 1 1 1 1 1
MD MMO MOO MOR MR MR. MRS MS MS. MUR
6 1 1 1 11 17 123 6 18 1
N NGT NOC NON NOR NS O O'S OD OLI
3 1 1 1 1 1 2 1 2 1
ON ONG OV P PAU PET PHE PIL PLA POP
1 1 1 2 1 1 1 1 1 1
Q R RAY REB REE REV ROB ROD ROY S
3 10 1 1 1 10 2 1 1 5
SAM SCO SMI SOR sr SR Sr. SR. STA STE
1 2 1 1 1 50917 3 562 2 1
SUE SUM SWA T TA TOB TWA UNK V VAN
1 1 1 2 1 1 1 1 345 1
VER VI VII VIR VOS W WAL WAR WIL WOL
1 44 14 1 1 7 1 1 2 1
X Y <NA>
1 1 7561920
The aggregated cleaning suggestions are:
Issue | last_name |
first_name |
midl_name |
Action |
---|---|---|---|---|
Missing | 122 | 254 | 553,015 | Exclude record if first or last name missing |
Lower case letters | 50 | 24 | 169 | Map all letters to upper case |
Digits | 90 | 81 | 299 | Map digits to empty string if not otherwise mapped |
Zero | 67 | 73 | 130 | Map zero to O if name contains at least one letter and no digits 1-9 |
One | 20 | 3 | 163 | |
Two | 1 | 1 | 13 | |
Three | 1 | 0 | 13 | |
Four | 3 | 1 | 15 | |
Five | 3 | 0 | 17 | |
Six | 1 | 1 | 7 | |
Seven | 4 | 0 | 8 | |
Eight | 0 | 2 | 9 | |
Nine | 4 | 0 | 6 | |
knitr::knit_exit()