Processing math: 100%
  • 1 Introduction
  • 2 Read entity data
  • 3 county_id & county_desc
  • 4 voter_reg_num
  • 5 registr_dt
  • 6 cancellation_dt
  • Timing

Last updated: 2021-03-30

Checks: 6 1

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3d4d205. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    _targets/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/blk_char.fst
    Ignored:    output/ent_blk.fst
    Ignored:    output/ent_cln.fst
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/local/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  analysis/m_01_3_drop_novar.Rmd
    Untracked:  analysis/m_01_4_parse_dates.Rmd
    Untracked:  analysis/m_01_5_check_admin.Rmd

Unstaged changes:
    Modified:   R/functions.R
    Modified:   _packages.R
    Modified:   _targets.R
    Modified:   analysis/index.Rmd
    Deleted:    analysis/m_01_3_parse_dates.Rmd
    Modified:   renv.lock

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.


# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.

# Set up the project environment, because {workflowr} knits each Rmd file 
# in a new R session, and doesn't execute the project .Rprofile

library(targets) # access data from the targets cache

library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
library(fs) # file system operations
library(vroom) # fast reading of delimited text files
library(tibble) # enhanced data frames
library(stringr) # string matching
library(skimr) # compact summary of each variable
library(lubridate) # date parsing

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(forcats) # manipulation of factors
library(ggplot2) # graphics

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)

1 Introduction

The aim of this set of meta notebooks is to work out how to read the raw entity data. and get it sufficiently neatened so that we can construct standardised names and modelling features without needing any further neatening. To be clear, the target (c_raw_entity_data) corresponding to the objective of this set of notebooks is the neatened raw data, before constructing any modelling features.

This notebook documents the checking the “administrative” variables for any issues that need fixing. The subsequent notebooks in this set will checking the other variables for any issues that need fixing.

Regardless of whether there are any issues that need to be fixed, the analyses here may inform our use of these variables in later analyses.

We don’t know any of the details on how the NCVR data is collected and processed, so our interpretations are only educated guesses. We have no intention of using the administrative variables as predictors for entity resolution. However, it’s possible that they may shed some light on data quality which might influence our choice of the records to be used for modelling.

Define the “administrative” variables:

  • county_id - County identification number
  • county_desc - County description
  • voter_reg_num - Voter registration number (unique by county)
  • registr_dt - Voter registration date
  • cancellation_dt - Cancellation date
vars_admin <- c("county_id", "county_desc", "voter_reg_num", "registr_dt", "cancellation_dt")  

2 Read entity data

Read the raw entity data file using the previously defined functions raw_entity_data_read(), raw_entity_data_excl_status(), raw_entity_data_excl_test(), raw_entity_data_drop_novar(), and raw_entity_data_parse_dates().

# Show the data file name
fs::path_file(f_entity_raw_tsv)
[1] "VR_20051125.txt.xz"
d <- raw_entity_data_read(f_entity_raw_tsv) %>% 
  raw_entity_data_excl_status() %>% 
  raw_entity_data_excl_test() %>% 
  raw_entity_data_drop_novar() %>% 
  raw_entity_data_parse_dates()

dim(d)
[1] 4099699      25

3 county_id & county_desc

county_id - County identification number
county_desc - County description

Look at county_id, a numeric code indicating a geographical area.

# number of unique values
d$county_id %>% unique() %>% length()
[1] 100
# summary of distribution of county ID interpreted as a number
d$county_id %>% as.integer() %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   32.00   53.00   52.26   76.00  100.00 
# number of records per county
d$county_id %>% as.integer() %>% table(useNA = "ifany") %>% knitr::kable()
. Freq
1 73306
2 21143
3 4918
4 13168
5 7173
6 6961
7 17568
8 10210
9 13784
10 53463
11 104267
12 35567
13 70738
14 44381
15 3705
16 29122
17 9133
18 79576
19 32989
20 12510
21 7680
22 6536
23 52398
24 20663
25 38552
26 125885
27 8088
28 21121
29 57337
30 15489
31 23944
32 138594
33 26494
34 175097
35 28263
36 65685
37 6698
38 5563
39 18650
40 6004
41 216993
42 17029
43 43012
44 26534
45 53679
46 9751
47 14111
48 1731
49 71218
50 20151
51 56868
52 6220
53 22278
54 20083
55 33012
56 20497
57 13246
58 15348
59 19581
60 410483
61 6219
62 13653
63 41542
64 45537
65 79484
66 12539
67 45043
68 59435
69 6660
70 20400
71 24078
72 4603
73 13915
74 71178
75 2774
76 68828
77 18084
78 46871
79 51491
80 53187
81 32070
82 22487
83 15589
84 33407
85 20027
86 38069
87 11069
88 14342
89 1027
90 72265
91 12357
92 367146
93 6907
94 7758
95 31233
96 55699
97 34655
98 30033
99 11410
100 12380
# plot the number of records per county
ggplot(d) +
  geom_bar(aes(x = forcats::fct_infreq(county_id))) +
  theme(panel.grid.major = element_blank(), 
        axis.text.x = element_text(angle = 90, hjust=1, vjust = 0.5)
  )

  • Never missing
  • Integer 1 .. 100 (as strings)
  • A small number of populous counties with a long tail of small counties

county_desc appears to be a text label corresponding to county_desc. Check that the county descriptions are in a 1:1 relationship with the county IDs.

# number of unique values
d$county_desc %>% unique() %>% length()
[1] 100
# number of unique values of code:label combinations
paste(d$county_id, d$county_desc) %>% unique() %>% length()
[1] 100
# Is code:label a 1:1 relationship?
# Is the number of unique labels equal to the number of unique code:label combinations
(d$county_desc %>% unique() %>% length()) ==
  (paste(d$county_id, d$county_desc) %>% unique() %>% length())
[1] TRUE
  • 100 unique values
  • county_desc in 1:1 relationship with county_id

They look reasonable, to the extent that I can tell without knowing anything about the counties.

4 voter_reg_num

voter_reg_num - Voter registration number (unique by county)

# some from the beginning of the file
d$voter_reg_num %>% head()
[1] "000000000001" "000000000001" "000000000001" "000000000001" "000000000001"
[6] "000000000001"
# some from the end of the file
d$voter_reg_num %>% tail()
[1] "000099848837" "000099848838" "000099848840" "000099848841" "000099870963"
[6] "000401437666"
# number of unique values
d$voter_reg_num %>% unique() %>% length()
[1] 1786064
# summary of distribution of voter registration number interpreted as a number
summary(as.integer(d$voter_reg_num))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
        1     44722    223167   6670211   7629018 401437666 
  • ~1.8M unique values

    • Much less than the number of rows, so the numbers are reused
  • Never missing

  • Integer 1 .. ~400M (as strings)

  • 12-digit integers with leading zeroes

Check whether county_id × voter_reg_num is unique, as claimed.

# number of records
nrow(d)
[1] 4099699
# number of unique county_id x voter_reg_num combinations
paste(d$county_id, d$voter_reg_num) %>% unique() %>% length()
[1] 4099699
# Are the county_id x voter_reg_num combinations unique?
# Number of unique county_id x voter_reg_num combinations equals the number of rows?
nrow(d) ==
  (paste(d$county_id, d$voter_reg_num) %>% unique() %>% length())
[1] TRUE
  • county_id × voter_reg_num is unique, as claimed

5 registr_dt

registr_dt - Voter registration date

# summary of distribution of registration date interpreted as a date
d$registr_dt %>% summary()
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"1899-09-14" "1988-01-01" "1997-01-09" "1993-01-18" "2002-11-05" "9999-10-21" 
# Get records apparently registered after the snapshot was taken
x <- d %>% 
  dplyr::filter(registr_dt > lubridate::ymd("2005-11-25")) # after snapshot date

# Number of records apparently registered after the snapshot was taken
nrow(x)
[1] 18
# Show records apparently registered after the snapshot was taken
x %>% 
  dplyr::arrange(registr_dt) %>% 
  dplyr::select(
    registr_dt, county_desc, voter_reg_num, last_name, first_name, 
    street_name, street_type_cd, res_city_desc, age
    ) %>% 
  knitr::kable()
registr_dt county_desc voter_reg_num last_name first_name street_name street_type_cd res_city_desc age
2007-08-15 SURRY 000030004622 ALLEN SEAN KENSINGTON DR MOUNT AIRY 23
2007-10-12 GASTON 000007601045 MOORE GEORGE UPPER SPENCER MOUNTAIN RD STANLEY 56
2008-10-05 GASTON 000007600410 HAMRICK JIMMY RALPHS BLVD GASTONIA 35
2008-10-05 GASTON 000007600823 MARTIN JASON PAMELA ST GASTONIA 30
2008-10-11 GASTON 000007600617 HUNSUCKER JESSICA ROLLINGWOOD DR STANLEY 23
2011-06-11 WILSON 000057476091 SMITH FLOYD FARMWOOD LOOP WILSON 80
2022-09-04 CHOWAN 000000014190 MEADS LEONARD MACEDONIA RD EDENTON 34
2201-06-12 MACON 000000034702 MCGEE MACK MASHBURN BRANCH RD FRANKLIN 72
2201-09-18 ROCKINGHAM 000000102698 HAIZLIP JAMES NC 87 NA EDEN 39
2201-11-28 CASWELL 000000021711 WHARTON REGINA 7TH ST YANCEYVILLE 25
2801-11-01 ORANGE 000000196807 GAUDIO LAUREN HILLSBOROUGH ST CHAPEL HILL 24
3001-09-25 WILSON 000057476878 MCGLAUGHON REBECCA BLOOMERY RD WILSON 26
3663-06-25 WILSON 000057476124 RENFROW TERRI OLD RALEIGH RD WILSON 24
5113-08-07 NASH 000000068243 SCHULTE MATTHEW SUNSET AVE ROCKY MOUNT 22
7614-03-05 PAMLICO 000006450688 MILLER ANITA FLORENCE ST NEW BERN 22
8480-10-09 CALDWELL 000014470774 PHILLIPS STEVEN BENFIELD DR HUDSON 52
9482-03-11 SWAIN 000000001209 DEHART DEBORAH SHEPHERDS CREEK RD BRYSON CITY 42
9999-10-21 ALAMANCE 000009066908 WOOTEN HEATHER CAROLINA CIR GRAHAM 24
  • Never missing

  • 18 records have registration date after the snapshot date

    • Range from a couple of years to millennia in the future
    • Presumably these are typos
  • Some records have early registration dates

Investigate the early registration dates.

First form a view on how early is too early by finding the maximum age and assuming registration at 21 years of age.

# summary of distribution of age interpreted as an integer
d$age %>% as.integer() %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   33.00   45.00   46.93   58.00  221.00 
# get some extreme quantiles of the age distribution
d$age %>% as.integer() %>% quantile(probs = c(0.003, 0.004, 0.995, 0.996, 0.997, 0.998, 0.999))
 0.3%  0.4% 99.5% 99.6% 99.7% 99.8% 99.9% 
    0    18    98   105   105   105   204 
# plot the distribution of age
d %>% 
  dplyr::mutate(age = as.integer(age)) %>% 
  dplyr::filter(age >= 80) %>% 
  ggplot() +
  geom_vline(xintercept = c(105, 125, 204), colour = "red") +
  geom_histogram(aes(x = age), binwidth = 1) +
  scale_y_log10()
Warning: Transformation introduced infinite values in continuous y-axis
Warning: Removed 79 rows containing missing values (geom_bar).

That opened a can of worms. There are obviously some issues with age. I will deal with that in detail in a later notebook.

Without considering age in detail, it appears that the maximum accurate age is not more than 120 years.

Assume that the maximum possible voter age is 116 years. The minimum registration age in North Carolina is 16 years (although I have no idea what it was 100 years ago). Therefore, assume that the oldest possible voter could have registered 100 years prior to the snapshot date. That is, regard any registration earlier than 1905-11-25 as very unlikely to be correct.

Now look at the distribution of registration dates that are no later than the snapshot date.

d %>% 
  dplyr::filter(registr_dt <= lubridate::ymd("2005-11-25")) %>% 
  ggplot() +
  geom_vline(xintercept = c(lubridate::ymd("1905-11-25"), lubridate::ymd("1935-11-25")),
             colour = "red") +
  geom_histogram(aes(x = registr_dt), binwidth = 365.25) + # 1yr bins
  scale_y_sqrt()

  • There is a large spike of registrations in 1900. These are bound to be errors.
  • Registration dates before ~1935 are suspect (because the distribution of probably accurate dates appears to run out around

Look at the relationship between age and registration date. The vast majority of these records will be OK, so may make it easier to spot anomalous regions.

First look at all the records (excluding those with registration date after the snapshot date).

d %>% 
  dplyr::mutate(age = as.integer(age)) %>% 
  dplyr::filter(registr_dt <= lubridate::ymd("2005-11-25")) %>%
  ggplot() +
  geom_hex(aes(x = age, y = registr_dt, fill = stat(log10(count))), binwidth = c(1, 365.25)) # 1yr bins x&y

The heavily populated triangualr region contains most of the cases and shows the (mostly) plausible combinations of registration date and age at snapshot date.

Now exclude the manifestly unlikely ages (< 18 or > 104 years).

d %>% 
  dplyr::mutate(age = as.integer(age)) %>% 
  dplyr::filter(
    dplyr::between(registr_dt, lubridate::ymd("1901-01-01"), lubridate::ymd("2005-11-25")),
    dplyr::between(age, 18, 104)
  ) %>%
  ggplot() +
  geom_hex(aes(x = age, y = registr_dt, fill = stat(log10(count))), binwidth = c(1, 365.25)) # 1yr bins x&y

  • The blue’ish upper triangle corresponds to people who were at least 18 years old at registration.

  • The black fringe below the blue-ish upper triangle corresponds to people who were less that 18 years old at registration.

  • The negative diagonal line corresponds to people who would have been zero years old at registration.

  • The points below the negative diagonal line correspond to people who appear to have been registered before they were born.

  • Most registration dates are consistent with age

  • A significant fraction of registration dates are inconsistent with age.

There appear to be a nontrivial number of age and registration date combinations that are implausible. These are most likely due to typos in those variables.

  • The implausible combinations are only a small fraction of the total records.
  • We are not intending to use age or registration date in the models, so the oddities are probably not an issue. Howver, it does indicate that we don’t want to treat this data as though it is perfectly accurate.

6 cancellation_dt

cancellation_dt - Cancellation date

# summary of distribution of registration date interpreted as a date
d$cancellation_dt %>% summary()
        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"1994-10-18" "1996-12-30" "1997-01-16" "1996-12-22" "1997-01-27" "2004-10-05" 
        NA's 
   "4095558" 
# look at the fraction of missing values
table(missing = is.na(d$cancellation_dt))
missing
  FALSE    TRUE 
   4141 4095558 
table(missing = is.na(d$cancellation_dt)) %>% prop.table() %>% round(3)
missing
FALSE  TRUE 
0.001 0.999 
# plot the distribution of nonmissing cancellation date
d %>% 
  dplyr::filter(!is.na(cancellation_dt)) %>% # not missing
  ggplot() +
  geom_histogram(aes(x = cancellation_dt), binwidth = 7) + # 1wk bins
  scale_y_sqrt()

  • Almost always missing

    • 18 (0%) nonmissing
  • Concentrated in 1996 and early 1997 (presumably some sort of administrative purge)

It is not clear what having a cancellation date means for records that are flagged as ACTIVE & VERIFIED. Perhaps they had been removed from the electoral roll in the past and subsequently reinstated.

Given the high proportion of missing values there is no point keeping cancellation_dt. Write a function to drop it.

# Function to drop cancel_dt
raw_entity_data_drop_cancel_dt <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::select(-cancellation_dt)
}

Apply the filter and track the number of rows before and after the filter.

# number of columns before dropping
d %>% 
  names() %>% length
[1] 25
d %>% 
  raw_entity_data_drop_cancel_dt() %>% 
  # number of columns after dropping
  names() %>% length
[1] 24

Timing

Computation time (excl. render): 256.585 sec elapsed

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggplot2_3.3.3    forcats_0.5.1    lubridate_1.7.10 skimr_2.1.3     
 [5] stringr_1.4.0    tibble_3.1.0     vroom_1.4.0      fs_1.5.0        
 [9] tictoc_1.0       here_1.0.1       workflowr_1.6.2  targets_0.3.1   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        lattice_0.20-41   ps_1.6.0          assertthat_0.2.1 
 [5] rprojroot_2.0.2   digest_0.6.27     utf8_1.2.1        R6_2.5.0         
 [9] repr_1.1.3        evaluate_0.14     highr_0.8         pillar_1.5.1     
[13] rlang_0.4.10      data.table_1.14.0 hexbin_1.28.2     callr_3.6.0      
[17] jquerylib_0.1.3   rmarkdown_2.7     labeling_0.4.2    igraph_1.2.6     
[21] bit_4.0.4         munsell_0.5.0     compiler_4.0.3    httpuv_1.5.5     
[25] xfun_0.22         pkgconfig_2.0.3   base64enc_0.1-3   htmltools_0.5.1.1
[29] tidyselect_1.1.0  bookdown_0.21     codetools_0.2-18  fansi_0.4.2      
[33] crayon_1.4.1      dplyr_1.0.5       withr_2.4.1       later_1.1.0.1    
[37] grid_4.0.3        jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.0  
[41] DBI_1.1.1         git2r_0.28.0      magrittr_2.0.1    scales_1.1.1     
[45] cli_2.3.1         stringi_1.5.3     farver_2.1.0      renv_0.13.1      
[49] promises_1.2.0.1  bslib_0.2.4       ellipsis_0.3.1    generics_0.1.0   
[53] vctrs_0.3.7       tools_4.0.3       bit64_4.0.5       glue_1.4.2       
[57] purrr_0.3.4       processx_3.5.0    parallel_4.0.3    yaml_2.2.1       
[61] colorspace_2.0-0  knitr_1.31        sass_0.3.1