1 Introduction

2 Read entity data

3 county_id & county_desc

4 voter_reg_num

5 registr_dt

6 cancellation_dt

Timing

Last updated: 2021-03-30

Checks: 6 1

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201104)

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 3d4d205

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3d4d205. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    _targets/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/blk_char.fst
    Ignored:    output/ent_blk.fst
    Ignored:    output/ent_cln.fst
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/local/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  analysis/m_01_3_drop_novar.Rmd
    Untracked:  analysis/m_01_4_parse_dates.Rmd
    Untracked:  analysis/m_01_5_check_admin.Rmd

Unstaged changes:
    Modified:   R/functions.R
    Modified:   _packages.R
    Modified:   _targets.R
    Modified:   analysis/index.Rmd
    Deleted:    analysis/m_01_3_parse_dates.Rmd
    Modified:   renv.lock

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.

# Set up the project environment, because {workflowr} knits each Rmd file 
# in a new R session, and doesn't execute the project .Rprofile

library(targets) # access data from the targets cache

library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
library(fs) # file system operations
library(vroom) # fast reading of delimited text files
library(tibble) # enhanced data frames
library(stringr) # string matching
library(skimr) # compact summary of each variable
library(lubridate) # date parsing


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(forcats) # manipulation of factors
library(ggplot2) # graphics

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)

1 Introduction

The aim of this set of meta notebooks is to work out how to read the raw entity data. and get it sufficiently neatened so that we can construct standardised names and modelling features without needing any further neatening. To be clear, the target (c_raw_entity_data) corresponding to the objective of this set of notebooks is the neatened raw data, before constructing any modelling features.

This notebook documents the checking the “administrative” variables for any issues that need fixing. The subsequent notebooks in this set will checking the other variables for any issues that need fixing.

Regardless of whether there are any issues that need to be fixed, the analyses here may inform our use of these variables in later analyses.

We don’t know any of the details on how the NCVR data is collected and processed, so our interpretations are only educated guesses. We have no intention of using the administrative variables as predictors for entity resolution. However, it’s possible that they may shed some light on data quality which might influence our choice of the records to be used for modelling.

Define the “administrative” variables:

county_id - County identification number
county_desc - County description
voter_reg_num - Voter registration number (unique by county)
registr_dt - Voter registration date
cancellation_dt - Cancellation date

vars_admin <- c("county_id", "county_desc", "voter_reg_num", "registr_dt", "cancellation_dt")

2 Read entity data

Read the raw entity data file using the previously defined functions raw_entity_data_read(), raw_entity_data_excl_status(), raw_entity_data_excl_test(), raw_entity_data_drop_novar(), and raw_entity_data_parse_dates().

# Show the data file name
fs::path_file(f_entity_raw_tsv)

[1] "VR_20051125.txt.xz"

d <- raw_entity_data_read(f_entity_raw_tsv) %>% 
  raw_entity_data_excl_status() %>% 
  raw_entity_data_excl_test() %>% 
  raw_entity_data_drop_novar() %>% 
  raw_entity_data_parse_dates()

dim(d)

[1] 4099699      25

3 county_id & county_desc

county_id - County identification number
county_desc - County description

Look at county_id, a numeric code indicating a geographical area.

# number of unique values
d$county_id %>% unique() %>% length()

[1] 100

# summary of distribution of county ID interpreted as a number
d$county_id %>% as.integer() %>% summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   32.00   53.00   52.26   76.00  100.00

# number of records per county
d$county_id %>% as.integer() %>% table(useNA = "ifany") %>% knitr::kable()

.	Freq
1	73306
2	21143
3	4918
4	13168
5	7173
6	6961
7	17568
8	10210
9	13784
10	53463
11	104267
12	35567
13	70738
14	44381
15	3705
16	29122
17	9133
18	79576
19	32989
20	12510
21	7680
22	6536
23	52398
24	20663
25	38552
26	125885
27	8088
28	21121
29	57337
30	15489
31	23944
32	138594
33	26494
34	175097
35	28263
36	65685
37	6698
38	5563
39	18650
40	6004
41	216993
42	17029
43	43012
44	26534
45	53679
46	9751
47	14111
48	1731
49	71218
50	20151
51	56868
52	6220
53	22278
54	20083
55	33012
56	20497
57	13246
58	15348
59	19581
60	410483
61	6219
62	13653
63	41542
64	45537
65	79484
66	12539
67	45043
68	59435
69	6660
70	20400
71	24078
72	4603
73	13915
74	71178
75	2774
76	68828
77	18084
78	46871
79	51491
80	53187
81	32070
82	22487
83	15589
84	33407
85	20027
86	38069
87	11069
88	14342
89	1027
90	72265
91	12357
92	367146
93	6907
94	7758
95	31233
96	55699
97	34655
98	30033
99	11410
100	12380

# plot the number of records per county
ggplot(d) +
  geom_bar(aes(x = forcats::fct_infreq(county_id))) +
  theme(panel.grid.major = element_blank(), 
        axis.text.x = element_text(angle = 90, hjust=1, vjust = 0.5)
  )

Never missing
Integer 1 .. 100 (as strings)
A small number of populous counties with a long tail of small counties

county_desc appears to be a text label corresponding to county_desc. Check that the county descriptions are in a 1:1 relationship with the county IDs.

# number of unique values
d$county_desc %>% unique() %>% length()

[1] 100

# number of unique values of code:label combinations
paste(d$county_id, d$county_desc) %>% unique() %>% length()

[1] 100

# Is code:label a 1:1 relationship?
# Is the number of unique labels equal to the number of unique code:label combinations
(d$county_desc %>% unique() %>% length()) ==
  (paste(d$county_id, d$county_desc) %>% unique() %>% length())

[1] TRUE

100 unique values
county_desc in 1:1 relationship with county_id

They look reasonable, to the extent that I can tell without knowing anything about the counties.

4 voter_reg_num

voter_reg_num - Voter registration number (unique by county)

# some from the beginning of the file
d$voter_reg_num %>% head()

[1] "000000000001" "000000000001" "000000000001" "000000000001" "000000000001"
[6] "000000000001"

# some from the end of the file
d$voter_reg_num %>% tail()

[1] "000099848837" "000099848838" "000099848840" "000099848841" "000099870963"
[6] "000401437666"

# number of unique values
d$voter_reg_num %>% unique() %>% length()

[1] 1786064

# summary of distribution of voter registration number interpreted as a number
summary(as.integer(d$voter_reg_num))

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
        1     44722    223167   6670211   7629018 401437666

~1.8M unique values
- Much less than the number of rows, so the numbers are reused
Never missing
Integer 1 .. ~400M (as strings)
12-digit integers with leading zeroes

Check whether county_id $\times$ voter_reg_num is unique, as claimed.

# number of records
nrow(d)

[1] 4099699

# number of unique county_id x voter_reg_num combinations
paste(d$county_id, d$voter_reg_num) %>% unique() %>% length()

[1] 4099699

# Are the county_id x voter_reg_num combinations unique?
# Number of unique county_id x voter_reg_num combinations equals the number of rows?
nrow(d) ==
  (paste(d$county_id, d$voter_reg_num) %>% unique() %>% length())

[1] TRUE

county_id $\times$ voter_reg_num is unique, as claimed

5 registr_dt

registr_dt - Voter registration date

# summary of distribution of registration date interpreted as a date
d$registr_dt %>% summary()

        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"1899-09-14" "1988-01-01" "1997-01-09" "1993-01-18" "2002-11-05" "9999-10-21"

# Get records apparently registered after the snapshot was taken
x <- d %>% 
  dplyr::filter(registr_dt > lubridate::ymd("2005-11-25")) # after snapshot date

# Number of records apparently registered after the snapshot was taken
nrow(x)

[1] 18

# Show records apparently registered after the snapshot was taken
x %>% 
  dplyr::arrange(registr_dt) %>% 
  dplyr::select(
    registr_dt, county_desc, voter_reg_num, last_name, first_name, 
    street_name, street_type_cd, res_city_desc, age
    ) %>% 
  knitr::kable()

registr_dt	county_desc	voter_reg_num	last_name	first_name	street_name	street_type_cd	res_city_desc	age
2007-08-15	SURRY	000030004622	ALLEN	SEAN	KENSINGTON	DR	MOUNT AIRY	23
2007-10-12	GASTON	000007601045	MOORE	GEORGE	UPPER SPENCER MOUNTAIN	RD	STANLEY	56
2008-10-05	GASTON	000007600410	HAMRICK	JIMMY	RALPHS	BLVD	GASTONIA	35
2008-10-05	GASTON	000007600823	MARTIN	JASON	PAMELA	ST	GASTONIA	30
2008-10-11	GASTON	000007600617	HUNSUCKER	JESSICA	ROLLINGWOOD	DR	STANLEY	23
2011-06-11	WILSON	000057476091	SMITH	FLOYD	FARMWOOD	LOOP	WILSON	80
2022-09-04	CHOWAN	000000014190	MEADS	LEONARD	MACEDONIA	RD	EDENTON	34
2201-06-12	MACON	000000034702	MCGEE	MACK	MASHBURN BRANCH	RD	FRANKLIN	72
2201-09-18	ROCKINGHAM	000000102698	HAIZLIP	JAMES	NC 87	NA	EDEN	39
2201-11-28	CASWELL	000000021711	WHARTON	REGINA	7TH	ST	YANCEYVILLE	25
2801-11-01	ORANGE	000000196807	GAUDIO	LAUREN	HILLSBOROUGH	ST	CHAPEL HILL	24
3001-09-25	WILSON	000057476878	MCGLAUGHON	REBECCA	BLOOMERY	RD	WILSON	26
3663-06-25	WILSON	000057476124	RENFROW	TERRI	OLD RALEIGH	RD	WILSON	24
5113-08-07	NASH	000000068243	SCHULTE	MATTHEW	SUNSET	AVE	ROCKY MOUNT	22
7614-03-05	PAMLICO	000006450688	MILLER	ANITA	FLORENCE	ST	NEW BERN	22
8480-10-09	CALDWELL	000014470774	PHILLIPS	STEVEN	BENFIELD	DR	HUDSON	52
9482-03-11	SWAIN	000000001209	DEHART	DEBORAH	SHEPHERDS CREEK	RD	BRYSON CITY	42
9999-10-21	ALAMANCE	000009066908	WOOTEN	HEATHER	CAROLINA	CIR	GRAHAM	24

Never missing
18 records have registration date after the snapshot date
- Range from a couple of years to millennia in the future
- Presumably these are typos
Some records have early registration dates

Investigate the early registration dates.

First form a view on how early is too early by finding the maximum age and assuming registration at 21 years of age.

# summary of distribution of age interpreted as an integer
d$age %>% as.integer() %>% summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   33.00   45.00   46.93   58.00  221.00

# get some extreme quantiles of the age distribution
d$age %>% as.integer() %>% quantile(probs = c(0.003, 0.004, 0.995, 0.996, 0.997, 0.998, 0.999))

 0.3%  0.4% 99.5% 99.6% 99.7% 99.8% 99.9% 
    0    18    98   105   105   105   204

# plot the distribution of age
d %>% 
  dplyr::mutate(age = as.integer(age)) %>% 
  dplyr::filter(age >= 80) %>% 
  ggplot() +
  geom_vline(xintercept = c(105, 125, 204), colour = "red") +
  geom_histogram(aes(x = age), binwidth = 1) +
  scale_y_log10()

Warning: Transformation introduced infinite values in continuous y-axis

Warning: Removed 79 rows containing missing values (geom_bar).

That opened a can of worms. There are obviously some issues with age. I will deal with that in detail in a later notebook.

Without considering age in detail, it appears that the maximum accurate age is not more than 120 years.

Assume that the maximum possible voter age is 116 years. The minimum registration age in North Carolina is 16 years (although I have no idea what it was 100 years ago). Therefore, assume that the oldest possible voter could have registered 100 years prior to the snapshot date. That is, regard any registration earlier than 1905-11-25 as very unlikely to be correct.

Now look at the distribution of registration dates that are no later than the snapshot date.

d %>% 
  dplyr::filter(registr_dt <= lubridate::ymd("2005-11-25")) %>% 
  ggplot() +
  geom_vline(xintercept = c(lubridate::ymd("1905-11-25"), lubridate::ymd("1935-11-25")),
             colour = "red") +
  geom_histogram(aes(x = registr_dt), binwidth = 365.25) + # 1yr bins
  scale_y_sqrt()

There is a large spike of registrations in 1900. These are bound to be errors.
Registration dates before ~1935 are suspect (because the distribution of probably accurate dates appears to run out around

Look at the relationship between age and registration date. The vast majority of these records will be OK, so may make it easier to spot anomalous regions.

First look at all the records (excluding those with registration date after the snapshot date).

d %>% 
  dplyr::mutate(age = as.integer(age)) %>% 
  dplyr::filter(registr_dt <= lubridate::ymd("2005-11-25")) %>%
  ggplot() +
  geom_hex(aes(x = age, y = registr_dt, fill = stat(log10(count))), binwidth = c(1, 365.25)) # 1yr bins x&y

The heavily populated triangualr region contains most of the cases and shows the (mostly) plausible combinations of registration date and age at snapshot date.

Now exclude the manifestly unlikely ages (< 18 or > 104 years).

d %>% 
  dplyr::mutate(age = as.integer(age)) %>% 
  dplyr::filter(
    dplyr::between(registr_dt, lubridate::ymd("1901-01-01"), lubridate::ymd("2005-11-25")),
    dplyr::between(age, 18, 104)
  ) %>%
  ggplot() +
  geom_hex(aes(x = age, y = registr_dt, fill = stat(log10(count))), binwidth = c(1, 365.25)) # 1yr bins x&y

The blue’ish upper triangle corresponds to people who were at least 18 years old at registration.
The black fringe below the blue-ish upper triangle corresponds to people who were less that 18 years old at registration.
The negative diagonal line corresponds to people who would have been zero years old at registration.
The points below the negative diagonal line correspond to people who appear to have been registered before they were born.
Most registration dates are consistent with age
A significant fraction of registration dates are inconsistent with age.

There appear to be a nontrivial number of age and registration date combinations that are implausible. These are most likely due to typos in those variables.

The implausible combinations are only a small fraction of the total records.
We are not intending to use age or registration date in the models, so the oddities are probably not an issue. Howver, it does indicate that we don’t want to treat this data as though it is perfectly accurate.

6 cancellation_dt

cancellation_dt - Cancellation date

# summary of distribution of registration date interpreted as a date
d$cancellation_dt %>% summary()

        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
"1994-10-18" "1996-12-30" "1997-01-16" "1996-12-22" "1997-01-27" "2004-10-05" 
        NA's 
   "4095558"

# look at the fraction of missing values
table(missing = is.na(d$cancellation_dt))

missing
  FALSE    TRUE 
   4141 4095558

table(missing = is.na(d$cancellation_dt)) %>% prop.table() %>% round(3)

missing
FALSE  TRUE 
0.001 0.999

# plot the distribution of nonmissing cancellation date
d %>% 
  dplyr::filter(!is.na(cancellation_dt)) %>% # not missing
  ggplot() +
  geom_histogram(aes(x = cancellation_dt), binwidth = 7) + # 1wk bins
  scale_y_sqrt()

Almost always missing
- 18 (0%) nonmissing
Concentrated in 1996 and early 1997 (presumably some sort of administrative purge)

It is not clear what having a cancellation date means for records that are flagged as ACTIVE & VERIFIED. Perhaps they had been removed from the electoral roll in the past and subsequently reinstated.

Given the high proportion of missing values there is no point keeping cancellation_dt. Write a function to drop it.

# Function to drop cancel_dt
raw_entity_data_drop_cancel_dt <- function(
  d # data frame - raw entity data
) {
  d %>%
    dplyr::select(-cancellation_dt)
}

Apply the filter and track the number of rows before and after the filter.

# number of columns before dropping
d %>% 
  names() %>% length

[1] 25

d %>% 
  raw_entity_data_drop_cancel_dt() %>% 
  # number of columns after dropping
  names() %>% length

[1] 24

Timing

Computation time (excl. render): 256.585 sec elapsed

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggplot2_3.3.3    forcats_0.5.1    lubridate_1.7.10 skimr_2.1.3     
 [5] stringr_1.4.0    tibble_3.1.0     vroom_1.4.0      fs_1.5.0        
 [9] tictoc_1.0       here_1.0.1       workflowr_1.6.2  targets_0.3.1   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        lattice_0.20-41   ps_1.6.0          assertthat_0.2.1 
 [5] rprojroot_2.0.2   digest_0.6.27     utf8_1.2.1        R6_2.5.0         
 [9] repr_1.1.3        evaluate_0.14     highr_0.8         pillar_1.5.1     
[13] rlang_0.4.10      data.table_1.14.0 hexbin_1.28.2     callr_3.6.0      
[17] jquerylib_0.1.3   rmarkdown_2.7     labeling_0.4.2    igraph_1.2.6     
[21] bit_4.0.4         munsell_0.5.0     compiler_4.0.3    httpuv_1.5.5     
[25] xfun_0.22         pkgconfig_2.0.3   base64enc_0.1-3   htmltools_0.5.1.1
[29] tidyselect_1.1.0  bookdown_0.21     codetools_0.2-18  fansi_0.4.2      
[33] crayon_1.4.1      dplyr_1.0.5       withr_2.4.1       later_1.1.0.1    
[37] grid_4.0.3        jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.0  
[41] DBI_1.1.1         git2r_0.28.0      magrittr_2.0.1    scales_1.1.1     
[45] cli_2.3.1         stringi_1.5.3     farver_2.1.0      renv_0.13.1      
[49] promises_1.2.0.1  bslib_0.2.4       ellipsis_0.3.1    generics_0.1.0   
[53] vctrs_0.3.7       tools_4.0.3       bit64_4.0.5       glue_1.4.2       
[57] purrr_0.3.4       processx_3.5.0    parallel_4.0.3    yaml_2.2.1       
[61] colorspace_2.0-0  knitr_1.31        sass_0.3.1

[meta] Check administrative variables

m_01_5_check_admin

Ross Gayler

2021-03-27