Last updated: 2021-03-28

Checks: 6 1

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201104)

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ef1be3d

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ef1be3d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    _targets/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/blk_char.fst
    Ignored:    output/ent_blk.fst
    Ignored:    output/ent_cln.fst
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/local/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  analysis/m_01_1_read_raw_entity_data.Rmd
    Untracked:  analysis/m_01_2_exclusions.Rmd
    Untracked:  analysis/m_01_3_parse_dates.Rmd

Unstaged changes:
    Modified:   R/functions.R
    Modified:   _targets.R
    Modified:   analysis/index.Rmd
    Deleted:    analysis/m_01_1_get_raw_entity_data.Rmd
    Deleted:    analysis/m_01_2_parse_dates.Rmd
    Modified:   renv.lock

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.

# Set up the project environment, because {workflowr} knits each Rmd file 
# in a new R session, and doesn't execute the project .Rprofile

library(targets) # access data from the targets cache

library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
library(fs) # file system operations
library(vroom) # fast reading of delimited text files
library(tibble) # enhanced data frames
library(stringr) # string matching

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)

1 Introduction

The aim of this set of meta notebooks is to work out how to read the raw entity data. and get it sufficiently neatened so that we can construct standardised names and modelling features without needing any further neatening. To be clear, the target (c_raw_entity_data) corresponding to the objective of this set of notebooks is the neatened raw data, before constructing any modelling features.

This notebook documents the process of excluding data rows that are not useful for this project.

The subsequent notebooks in this set will check that all the columns have been read correctly and work out how to fix them, if necessary.

1.1 Exclusions

There are four variables dealing with voter status Preliminary examination of these variables shows that some records correspond to people who have been removed from the electoral roll. This project focuses on ambiguity arising from the fact that some names are common Therefore, we want the entity data to be as accurate as possible and free of duplicate records so that we don’t introduce ambiguity because of data quality issues.

Speaking of data quality issues, it is my experience that large databases often contain test records.

I will exclude all the data rows that have any of these data quality issues. I will do this early in the pipeline to minimise the number of records processed and to avoid including these records in the subsequent quality analyses.

2 Read entity data

Read the raw entity data file using the previously defined function, raw_entity_data_read().

# Show the data file name
fs::path_file(f_entity_raw_tsv)

[1] "VR_20051125.txt.xz"

d <- raw_entity_data_read(f_entity_raw_tsv)

3 Voter status variables

Check the internal consistency of the voter status variables.

The data dictionary describes these as two pairs of variables, where each pair consists of a code variable and a label variable. I would expect the code and label values to be in a 1:1 relationship.

3.1 Status

d %>% 
  dplyr::distinct(status_cd, voter_status_desc) %>% 
  dplyr::arrange(status_cd, voter_status_desc) %>% 
  knitr::kable()

status_cd	voter_status_desc
A	ACTIVE
D	DENIED
I	INACTIVE
R	REMOVED
S	TEMPORARY REGISTRATION
NA	NA

The code and label values are in a 1:1 relationship. Use the label variable (voter_status_desc) because it is more meaningful.

Look at the distribution of values across the data.

table(Label = d$voter_status_desc, useNA = "always") %>% knitr::kable()

Label	Freq
ACTIVE	4914521
DENIED	41348
INACTIVE	495603
REMOVED	2546485
TEMPORARY REGISTRATION	5334
NA	2

~60% of records are “ACTIVE”. All the other values have negative connotations.

3.2 Reason

d %>% 
  dplyr::distinct(reason_cd, voter_status_reason_desc) %>% 
  dplyr::arrange(reason_cd, voter_status_reason_desc) %>% 
  knitr::kable()

reason_cd	voter_status_reason_desc
A1	UNVERIFIED
A2	CONFIRMATION PENDING
AA	ARMED FORCES
AL	LEGACY DATA
AN	UNVERIFIED NEW
AP	VERIFICATION PENDING
AV	VERIFIED
DI	UNAVAILABLE ESSENTIAL INFORMATION
DU	VERIFICATION RETURNED UNDELIVERABLE
IL	LEGACY - CONVERSION
IN	CONFIRMATION NOT RETURNED
IU	CONFIRMATION RETURNED UNDELIVERABLE
R2	DUPLICATE
RA	ADMINISTRATIVE
RC	REMOVED DUE TO SUSTAINED CHALLENGE
RD	DECEASED
RF	FELONY CONVICTION
RL	MOVED FROM COUNTY
RM	REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
RP	REMOVED UNDER OLD PURGE LAW
RQ	REQUEST FROM VOTER
RS	MOVED FROM STATE
RT	TEMPORARY REGISTRANT
SM	MILITARY
SO	OVERSEAS CITIZEN
SP	PREVIOUSLY REGISTERED
NA	NA

The code and label values are in a 1:1 relationship. Use the label variable (voter_status_reason_desc) because it is more meaningful.

Look at the distribution of values across the data.

table(Label = d$voter_status_reason_desc, useNA = "always") %>% knitr::kable()

Label	Freq
ADMINISTRATIVE	59008
ARMED FORCES	50
CONFIRMATION NOT RETURNED	181320
CONFIRMATION PENDING	71296
CONFIRMATION RETURNED UNDELIVERABLE	303197
DECEASED	443486
DUPLICATE	78951
FELONY CONVICTION	63501
LEGACY - CONVERSION	10585
LEGACY DATA	523899
MILITARY	3975
MOVED FROM COUNTY	888056
MOVED FROM STATE	89049
OVERSEAS CITIZEN	1307
PREVIOUSLY REGISTERED	51
REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS	551073
REMOVED DUE TO SUSTAINED CHALLENGE	662
REMOVED UNDER OLD PURGE LAW	367511
REQUEST FROM VOTER	4194
TEMPORARY REGISTRANT	729
UNAVAILABLE ESSENTIAL INFORMATION	6991
UNVERIFIED	13737
UNVERIFIED NEW	7517
VERIFICATION PENDING	198333
VERIFICATION RETURNED UNDELIVERABLE	34357
VERIFIED	4100220
NA	238

~50% of records are “VERIFIED”. That value seems to have the most positive connotations.

3.3 Status \(\times\) Reason

Look at the combinations of the two variables.

d %>% 
  with(table(voter_status_reason_desc, voter_status_desc, useNA = "always")) %>% 
  knitr::kable()

	ACTIVE	DENIED	INACTIVE	REMOVED	TEMPORARY REGISTRATION	NA
ADMINISTRATIVE	0	0	0	59008	0	0
ARMED FORCES	50	0	0	0	0	0
CONFIRMATION NOT RETURNED	0	0	181320	0	0	0
CONFIRMATION PENDING	71295	0	0	1	0	0
CONFIRMATION RETURNED UNDELIVERABLE	0	0	303197	0	0	0
DECEASED	0	0	0	443486	0	0
DUPLICATE	0	0	0	78951	0	0
FELONY CONVICTION	0	0	0	63501	0	0
LEGACY - CONVERSION	1	0	10584	0	0	0
LEGACY DATA	523897	0	2	0	0	0
MILITARY	0	0	0	0	3975	0
MOVED FROM COUNTY	0	0	0	888055	0	1
MOVED FROM STATE	0	0	0	89049	0	0
OVERSEAS CITIZEN	0	0	0	0	1307	0
PREVIOUSLY REGISTERED	0	0	0	1	50	0
REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS	0	0	0	551072	0	1
REMOVED DUE TO SUSTAINED CHALLENGE	0	0	0	662	0	0
REMOVED UNDER OLD PURGE LAW	0	0	0	367511	0	0
REQUEST FROM VOTER	0	0	0	4194	0	0
TEMPORARY REGISTRANT	0	0	0	729	0	0
UNAVAILABLE ESSENTIAL INFORMATION	0	6990	0	1	0	0
UNVERIFIED	13731	0	0	4	2	0
UNVERIFIED NEW	7516	0	0	1	0	0
VERIFICATION PENDING	198331	0	1	1	0	0
VERIFICATION RETURNED UNDELIVERABLE	0	34357	0	0	0	0
VERIFIED	4099700	1	499	20	0	0
NA	0	0	0	238	0	0

~50% are “ACTIVE” and “VERIFIED”

On a common-sense interpretation of the labels, these are the records that have survived the registration checking process, so are most likely free of errors and duplicates.

Write a function to filter the records to keep only those that are “ACTIVE” and “VERIFIED”.

# Function to exclude records based on voter status
raw_entity_data_excl_status <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::filter(
      voter_status_desc == "ACTIVE" & voter_status_reason_desc == "VERIFIED"
    )
}

Apply the filter before moving on to the next exclusion condition and track the number of rows before and after the filter.

# number of rows before filtering
nrow(d)

[1] 8003293

d <- d %>% raw_entity_data_excl_status()

# number of rows before filtering
nrow(d)

[1] 4099700

4 Test cases

Look for any records that contain the “test” as a word (i.e. enclosed by word boundaries).

d %>% 
  dplyr::select(ends_with("_name"), res_city_desc, area_cd, phone_num) %>% 
  dplyr::filter(
    stringr::str_detect(last_name, regex("\\btest\\b", ignore_case = TRUE)) |
    stringr::str_detect(first_name, regex("\\btest\\b", ignore_case = TRUE)) |
    stringr::str_detect(midl_name, regex("\\btest\\b", ignore_case = TRUE)) |
    stringr::str_detect(street_name, regex("\\btest\\b", ignore_case = TRUE))
  ) %>% 
  knitr::kable()

last_name	first_name	midl_name	street_name	res_city_desc	area_cd	phone_num
TEAGUE	NORMA	JEAN	TEST FARM	WAYNESVILLE	NA	NA
TEST	KAY	ANN	ELM	WELDON	NA	NA
TEST	FREDERICK	HAROLD	HENDERSONVILLE	ASHEVILLE	NA	NA
PHILLIPS	DEBORAH	MOORE	TEST FARM	WALLACE	910	2859781
PHILLIPS	MICHAEL	DOWNING	TEST FARM	WALLACE	910	2859781
SOLES	SHANNON	ACACIA	TEST FARM	WALLACE	910	2852575
TEST	DAVID	W	GREENWAY	CHARLOTTE	NA	NA
TEST	DANIEL	W	HICKORY KNOLL	GREENSBORO	NA	NA
TEST	GEORGE	A	ROSWELL	CHARLOTTE	NA	NA
TEST	DORIS	L	ROSWELL	CHARLOTTE	NA	NA
TEST	KATHERINE	COOKE	GREENWAY	CHARLOTTE	NA	NA
TEST	KENNETH	FARREL	WAVERLY	CLAYTON	NA	NA
TEST	THIS	NA	HIGH POINT	JAMESTOWN	NA	NA

Online search finds that “Test” is a known last name in the USA.
Online search finds that “Test Farm Road” is a known address in North Carolina.
The name “TEST, This” looks like a test case (with ACTIVE VERIFIED status).

Write a function to filter the records to exclude those with name “TEST, This”.

There are likely to be many other test cases not detected by this filter. However, they are likely to be only a tiny fraction of the records - so it’s not a big problem if they are missed. If I find any other test cases later I will come back here and revise the exclusion criteria.

# Function to exclude test records based on names
raw_entity_data_excl_test <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::filter(
      ! (
        stringr::str_detect(last_name, regex("\\bTEST\\b", ignore_case = TRUE)) &
          stringr::str_detect(first_name, regex("\\bTHIS\\b", ignore_case = TRUE))
      )
    )
}

Apply the filter and track the number of rows before and after the filter.

# number of rows before filtering
nrow(d)

[1] 4099700

d <- d %>% raw_entity_data_excl_test()

# number of rows before filtering
nrow(d)

[1] 4099699

Timing

Computation time (excl. render): 65.754 sec elapsed

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] stringr_1.4.0   tibble_3.1.0    vroom_1.4.0     fs_1.5.0       
[5] tictoc_1.0      here_1.0.1      workflowr_1.6.2 targets_0.3.0  

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.22         bslib_0.2.4       purrr_0.3.4      
 [5] vctrs_0.3.6       generics_0.1.0    htmltools_0.5.1.1 yaml_2.2.1       
 [9] utf8_1.2.1        rlang_0.4.10      later_1.1.0.1     pillar_1.5.1     
[13] jquerylib_0.1.3   DBI_1.1.1         glue_1.4.2        withr_2.4.1      
[17] bit64_4.0.5       lifecycle_1.0.0   codetools_0.2-18  evaluate_0.14    
[21] knitr_1.31        callr_3.5.1       httpuv_1.5.5      ps_1.6.0         
[25] parallel_4.0.3    fansi_0.4.2       highr_0.8         Rcpp_1.0.6       
[29] renv_0.13.1       promises_1.2.0.1  jsonlite_1.7.2    bit_4.0.4        
[33] digest_0.6.27     stringi_1.5.3     bookdown_0.21     processx_3.5.0   
[37] dplyr_1.0.5       rprojroot_2.0.2   cli_2.3.1         tools_4.0.3      
[41] magrittr_2.0.1    sass_0.3.1        crayon_1.4.1      pkgconfig_2.0.3  
[45] ellipsis_0.3.1    data.table_1.14.0 assertthat_0.2.1  rmarkdown_2.7    
[49] R6_2.5.0          igraph_1.2.6      compiler_4.0.3    git2r_0.28.0

[meta] Apply row exclusions

m_01_2_exclusions

Ross Gayler

2021-03-28