Last updated: 2021-03-28

Checks: 6 1

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ef1be3d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    _targets/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/blk_char.fst
    Ignored:    output/ent_blk.fst
    Ignored:    output/ent_cln.fst
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/local/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  analysis/m_01_1_read_raw_entity_data.Rmd
    Untracked:  analysis/m_01_2_exclusions.Rmd
    Untracked:  analysis/m_01_3_parse_dates.Rmd

Unstaged changes:
    Modified:   R/functions.R
    Modified:   _targets.R
    Modified:   analysis/index.Rmd
    Deleted:    analysis/m_01_1_get_raw_entity_data.Rmd
    Deleted:    analysis/m_01_2_parse_dates.Rmd
    Modified:   renv.lock

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.


# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.

# Set up the project environment, because {workflowr} knits each Rmd file 
# in a new R session, and doesn't execute the project .Rprofile

library(targets) # access data from the targets cache

library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
library(fs) # file system operations
library(vroom) # fast reading of delimited text files
library(tibble) # enhanced data frames
library(stringr) # string matching

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)

1 Introduction

The aim of this set of meta notebooks is to work out how to read the raw entity data. and get it sufficiently neatened so that we can construct standardised names and modelling features without needing any further neatening. To be clear, the target (c_raw_entity_data) corresponding to the objective of this set of notebooks is the neatened raw data, before constructing any modelling features.

This notebook documents the process of excluding data rows that are not useful for this project.

The subsequent notebooks in this set will check that all the columns have been read correctly and work out how to fix them, if necessary.

1.1 Exclusions

There are four variables dealing with voter status Preliminary examination of these variables shows that some records correspond to people who have been removed from the electoral roll. This project focuses on ambiguity arising from the fact that some names are common Therefore, we want the entity data to be as accurate as possible and free of duplicate records so that we don’t introduce ambiguity because of data quality issues.

Speaking of data quality issues, it is my experience that large databases often contain test records.

I will exclude all the data rows that have any of these data quality issues. I will do this early in the pipeline to minimise the number of records processed and to avoid including these records in the subsequent quality analyses.

2 Read entity data

Read the raw entity data file using the previously defined function, raw_entity_data_read().

# Show the data file name
fs::path_file(f_entity_raw_tsv)
[1] "VR_20051125.txt.xz"
d <- raw_entity_data_read(f_entity_raw_tsv)

3 Voter status variables

Check the internal consistency of the voter status variables.

The data dictionary describes these as two pairs of variables, where each pair consists of a code variable and a label variable. I would expect the code and label values to be in a 1:1 relationship.

3.1 Status

d %>% 
  dplyr::distinct(status_cd, voter_status_desc) %>% 
  dplyr::arrange(status_cd, voter_status_desc) %>% 
  knitr::kable()
status_cd voter_status_desc
A ACTIVE
D DENIED
I INACTIVE
R REMOVED
S TEMPORARY REGISTRATION
NA NA
  • The code and label values are in a 1:1 relationship. Use the label variable (voter_status_desc) because it is more meaningful.

Look at the distribution of values across the data.

table(Label = d$voter_status_desc, useNA = "always") %>% knitr::kable()
Label Freq
ACTIVE 4914521
DENIED 41348
INACTIVE 495603
REMOVED 2546485
TEMPORARY REGISTRATION 5334
NA 2
  • ~60% of records are “ACTIVE”. All the other values have negative connotations.

3.2 Reason

d %>% 
  dplyr::distinct(reason_cd, voter_status_reason_desc) %>% 
  dplyr::arrange(reason_cd, voter_status_reason_desc) %>% 
  knitr::kable()
reason_cd voter_status_reason_desc
A1 UNVERIFIED
A2 CONFIRMATION PENDING
AA ARMED FORCES
AL LEGACY DATA
AN UNVERIFIED NEW
AP VERIFICATION PENDING
AV VERIFIED
DI UNAVAILABLE ESSENTIAL INFORMATION
DU VERIFICATION RETURNED UNDELIVERABLE
IL LEGACY - CONVERSION
IN CONFIRMATION NOT RETURNED
IU CONFIRMATION RETURNED UNDELIVERABLE
R2 DUPLICATE
RA ADMINISTRATIVE
RC REMOVED DUE TO SUSTAINED CHALLENGE
RD DECEASED
RF FELONY CONVICTION
RL MOVED FROM COUNTY
RM REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS
RP REMOVED UNDER OLD PURGE LAW
RQ REQUEST FROM VOTER
RS MOVED FROM STATE
RT TEMPORARY REGISTRANT
SM MILITARY
SO OVERSEAS CITIZEN
SP PREVIOUSLY REGISTERED
NA NA
  • The code and label values are in a 1:1 relationship. Use the label variable (voter_status_reason_desc) because it is more meaningful.

Look at the distribution of values across the data.

table(Label = d$voter_status_reason_desc, useNA = "always") %>% knitr::kable()
Label Freq
ADMINISTRATIVE 59008
ARMED FORCES 50
CONFIRMATION NOT RETURNED 181320
CONFIRMATION PENDING 71296
CONFIRMATION RETURNED UNDELIVERABLE 303197
DECEASED 443486
DUPLICATE 78951
FELONY CONVICTION 63501
LEGACY - CONVERSION 10585
LEGACY DATA 523899
MILITARY 3975
MOVED FROM COUNTY 888056
MOVED FROM STATE 89049
OVERSEAS CITIZEN 1307
PREVIOUSLY REGISTERED 51
REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS 551073
REMOVED DUE TO SUSTAINED CHALLENGE 662
REMOVED UNDER OLD PURGE LAW 367511
REQUEST FROM VOTER 4194
TEMPORARY REGISTRANT 729
UNAVAILABLE ESSENTIAL INFORMATION 6991
UNVERIFIED 13737
UNVERIFIED NEW 7517
VERIFICATION PENDING 198333
VERIFICATION RETURNED UNDELIVERABLE 34357
VERIFIED 4100220
NA 238
  • ~50% of records are “VERIFIED”. That value seems to have the most positive connotations.

3.3 Status \(\times\) Reason

Look at the combinations of the two variables.

d %>% 
  with(table(voter_status_reason_desc, voter_status_desc, useNA = "always")) %>% 
  knitr::kable()
ACTIVE DENIED INACTIVE REMOVED TEMPORARY REGISTRATION NA
ADMINISTRATIVE 0 0 0 59008 0 0
ARMED FORCES 50 0 0 0 0 0
CONFIRMATION NOT RETURNED 0 0 181320 0 0 0
CONFIRMATION PENDING 71295 0 0 1 0 0
CONFIRMATION RETURNED UNDELIVERABLE 0 0 303197 0 0 0
DECEASED 0 0 0 443486 0 0
DUPLICATE 0 0 0 78951 0 0
FELONY CONVICTION 0 0 0 63501 0 0
LEGACY - CONVERSION 1 0 10584 0 0 0
LEGACY DATA 523897 0 2 0 0 0
MILITARY 0 0 0 0 3975 0
MOVED FROM COUNTY 0 0 0 888055 0 1
MOVED FROM STATE 0 0 0 89049 0 0
OVERSEAS CITIZEN 0 0 0 0 1307 0
PREVIOUSLY REGISTERED 0 0 0 1 50 0
REMOVED AFTER 2 FED GENERAL ELECTIONS IN INACTIVE STATUS 0 0 0 551072 0 1
REMOVED DUE TO SUSTAINED CHALLENGE 0 0 0 662 0 0
REMOVED UNDER OLD PURGE LAW 0 0 0 367511 0 0
REQUEST FROM VOTER 0 0 0 4194 0 0
TEMPORARY REGISTRANT 0 0 0 729 0 0
UNAVAILABLE ESSENTIAL INFORMATION 0 6990 0 1 0 0
UNVERIFIED 13731 0 0 4 2 0
UNVERIFIED NEW 7516 0 0 1 0 0
VERIFICATION PENDING 198331 0 1 1 0 0
VERIFICATION RETURNED UNDELIVERABLE 0 34357 0 0 0 0
VERIFIED 4099700 1 499 20 0 0
NA 0 0 0 238 0 0
  • ~50% are “ACTIVE” and “VERIFIED”

On a common-sense interpretation of the labels, these are the records that have survived the registration checking process, so are most likely free of errors and duplicates.

Write a function to filter the records to keep only those that are “ACTIVE” and “VERIFIED”.

# Function to exclude records based on voter status
raw_entity_data_excl_status <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::filter(
      voter_status_desc == "ACTIVE" & voter_status_reason_desc == "VERIFIED"
    )
}

Apply the filter before moving on to the next exclusion condition and track the number of rows before and after the filter.

# number of rows before filtering
nrow(d)
[1] 8003293
d <- d %>% raw_entity_data_excl_status()

# number of rows before filtering
nrow(d)
[1] 4099700

4 Test cases

Look for any records that contain the “test” as a word (i.e. enclosed by word boundaries).

d %>% 
  dplyr::select(ends_with("_name"), res_city_desc, area_cd, phone_num) %>% 
  dplyr::filter(
    stringr::str_detect(last_name, regex("\\btest\\b", ignore_case = TRUE)) |
    stringr::str_detect(first_name, regex("\\btest\\b", ignore_case = TRUE)) |
    stringr::str_detect(midl_name, regex("\\btest\\b", ignore_case = TRUE)) |
    stringr::str_detect(street_name, regex("\\btest\\b", ignore_case = TRUE))
  ) %>% 
  knitr::kable()
last_name first_name midl_name street_name res_city_desc area_cd phone_num
TEAGUE NORMA JEAN TEST FARM WAYNESVILLE NA NA
TEST KAY ANN ELM WELDON NA NA
TEST FREDERICK HAROLD HENDERSONVILLE ASHEVILLE NA NA
PHILLIPS DEBORAH MOORE TEST FARM WALLACE 910 2859781
PHILLIPS MICHAEL DOWNING TEST FARM WALLACE 910 2859781
SOLES SHANNON ACACIA TEST FARM WALLACE 910 2852575
TEST DAVID W GREENWAY CHARLOTTE NA NA
TEST DANIEL W HICKORY KNOLL GREENSBORO NA NA
TEST GEORGE A ROSWELL CHARLOTTE NA NA
TEST DORIS L ROSWELL CHARLOTTE NA NA
TEST KATHERINE COOKE GREENWAY CHARLOTTE NA NA
TEST KENNETH FARREL WAVERLY CLAYTON NA NA
TEST THIS NA HIGH POINT JAMESTOWN NA NA
  • Online search finds that “Test” is a known last name in the USA.
  • Online search finds that “Test Farm Road” is a known address in North Carolina.
  • The name “TEST, This” looks like a test case (with ACTIVE VERIFIED status).

Write a function to filter the records to exclude those with name “TEST, This”.

There are likely to be many other test cases not detected by this filter. However, they are likely to be only a tiny fraction of the records - so it’s not a big problem if they are missed. If I find any other test cases later I will come back here and revise the exclusion criteria.

# Function to exclude test records based on names
raw_entity_data_excl_test <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::filter(
      ! (
        stringr::str_detect(last_name, regex("\\bTEST\\b", ignore_case = TRUE)) &
          stringr::str_detect(first_name, regex("\\bTHIS\\b", ignore_case = TRUE))
      )
    )
}

Apply the filter and track the number of rows before and after the filter.

# number of rows before filtering
nrow(d)
[1] 4099700
d <- d %>% raw_entity_data_excl_test()

# number of rows before filtering
nrow(d)
[1] 4099699

Timing

Computation time (excl. render): 65.754 sec elapsed

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] stringr_1.4.0   tibble_3.1.0    vroom_1.4.0     fs_1.5.0       
[5] tictoc_1.0      here_1.0.1      workflowr_1.6.2 targets_0.3.0  

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.22         bslib_0.2.4       purrr_0.3.4      
 [5] vctrs_0.3.6       generics_0.1.0    htmltools_0.5.1.1 yaml_2.2.1       
 [9] utf8_1.2.1        rlang_0.4.10      later_1.1.0.1     pillar_1.5.1     
[13] jquerylib_0.1.3   DBI_1.1.1         glue_1.4.2        withr_2.4.1      
[17] bit64_4.0.5       lifecycle_1.0.0   codetools_0.2-18  evaluate_0.14    
[21] knitr_1.31        callr_3.5.1       httpuv_1.5.5      ps_1.6.0         
[25] parallel_4.0.3    fansi_0.4.2       highr_0.8         Rcpp_1.0.6       
[29] renv_0.13.1       promises_1.2.0.1  jsonlite_1.7.2    bit_4.0.4        
[33] digest_0.6.27     stringi_1.5.3     bookdown_0.21     processx_3.5.0   
[37] dplyr_1.0.5       rprojroot_2.0.2   cli_2.3.1         tools_4.0.3      
[41] magrittr_2.0.1    sass_0.3.1        crayon_1.4.1      pkgconfig_2.0.3  
[45] ellipsis_0.3.1    data.table_1.14.0 assertthat_0.2.1  rmarkdown_2.7    
[49] R6_2.5.0          igraph_1.2.6      compiler_4.0.3    git2r_0.28.0