Last updated: 2021-03-30
Checks: 6 1
Knit directory:
fa_sim_cal/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown is untracked by Git.
To know which version of the R Markdown file created these
results, you’ll want to first commit it to the Git repo. If
you’re still working on the analysis, you can ignore this
warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and
build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201104)
was run prior to running the code in the R Markdown file.
Setting a seed ensures that any results that rely on randomness, e.g.
subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 3d4d205. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the
analysis have been committed to Git prior to generating the results (you can
use wflow_publish
or wflow_git_commit
). workflowr only
checks the R Markdown file, but you know if there are other scripts or data
files that it depends on. Below is the status of the Git repository when the
results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .tresorit/
Ignored: _targets/
Ignored: data/VR_20051125.txt.xz
Ignored: output/blk_char.fst
Ignored: output/ent_blk.fst
Ignored: output/ent_cln.fst
Ignored: output/ent_raw.fst
Ignored: renv/library/
Ignored: renv/local/
Ignored: renv/staging/
Untracked files:
Untracked: analysis/m_01_3_drop_novar.Rmd
Untracked: analysis/m_01_4_parse_dates.Rmd
Untracked: analysis/m_01_5_check_admin.Rmd
Unstaged changes:
Modified: R/functions.R
Modified: _packages.R
Modified: _targets.R
Modified: analysis/index.Rmd
Deleted: analysis/m_01_3_parse_dates.Rmd
Modified: renv.lock
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish()
to start tracking its development.
# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.
# Set up the project environment, because {workflowr} knits each Rmd file
# in a new R session, and doesn't execute the project .Rprofile
library(targets) # access data from the targets cache
library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
library(fs) # file system operations
library(vroom) # fast reading of delimited text files
library(tibble) # enhanced data frames
library(stringr) # string matching
library(skimr) # compact summary of each variable
library(lubridate) # date parsing
Attaching package: 'lubridate'
The following objects are masked from 'package:base':
date, intersect, setdiff, union
library(forcats) # manipulation of factors
library(ggplot2) # graphics
# start the execution time clock
tictoc::tic("Computation time (excl. render)")
# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)
The aim of this set of meta notebooks is to work out how to read the raw
entity data. and get it sufficiently neatened so that we can construct
standardised names and modelling features without needing any further
neatening. To be clear, the target (c_raw_entity_data
) corresponding
to the objective of this set of notebooks is the neatened raw data,
before constructing any modelling features.
This notebook documents the checking the “administrative” variables for any issues that need fixing. The subsequent notebooks in this set will checking the other variables for any issues that need fixing.
Regardless of whether there are any issues that need to be fixed, the analyses here may inform our use of these variables in later analyses.
We don’t know any of the details on how the NCVR data is collected and processed, so our interpretations are only educated guesses. We have no intention of using the administrative variables as predictors for entity resolution. However, it’s possible that they may shed some light on data quality which might influence our choice of the records to be used for modelling.
Define the “administrative” variables:
county_id
- County identification numbercounty_desc
- County descriptionvoter_reg_num
- Voter registration number (unique by county)registr_dt
- Voter registration datecancellation_dt
- Cancellation datevars_admin <- c("county_id", "county_desc", "voter_reg_num", "registr_dt", "cancellation_dt")
Read the raw entity data file using the previously defined functions
raw_entity_data_read()
, raw_entity_data_excl_status()
,
raw_entity_data_excl_test()
, raw_entity_data_drop_novar()
, and
raw_entity_data_parse_dates()
.
# Show the data file name
fs::path_file(f_entity_raw_tsv)
[1] "VR_20051125.txt.xz"
d <- raw_entity_data_read(f_entity_raw_tsv) %>%
raw_entity_data_excl_status() %>%
raw_entity_data_excl_test() %>%
raw_entity_data_drop_novar() %>%
raw_entity_data_parse_dates()
dim(d)
[1] 4099699 25
county_id
- County identification number
county_desc
- County description
Look at county_id
, a numeric code indicating a geographical area.
# number of unique values
d$county_id %>% unique() %>% length()
[1] 100
# summary of distribution of county ID interpreted as a number
d$county_id %>% as.integer() %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 32.00 53.00 52.26 76.00 100.00
# number of records per county
d$county_id %>% as.integer() %>% table(useNA = "ifany") %>% knitr::kable()
. | Freq |
---|---|
1 | 73306 |
2 | 21143 |
3 | 4918 |
4 | 13168 |
5 | 7173 |
6 | 6961 |
7 | 17568 |
8 | 10210 |
9 | 13784 |
10 | 53463 |
11 | 104267 |
12 | 35567 |
13 | 70738 |
14 | 44381 |
15 | 3705 |
16 | 29122 |
17 | 9133 |
18 | 79576 |
19 | 32989 |
20 | 12510 |
21 | 7680 |
22 | 6536 |
23 | 52398 |
24 | 20663 |
25 | 38552 |
26 | 125885 |
27 | 8088 |
28 | 21121 |
29 | 57337 |
30 | 15489 |
31 | 23944 |
32 | 138594 |
33 | 26494 |
34 | 175097 |
35 | 28263 |
36 | 65685 |
37 | 6698 |
38 | 5563 |
39 | 18650 |
40 | 6004 |
41 | 216993 |
42 | 17029 |
43 | 43012 |
44 | 26534 |
45 | 53679 |
46 | 9751 |
47 | 14111 |
48 | 1731 |
49 | 71218 |
50 | 20151 |
51 | 56868 |
52 | 6220 |
53 | 22278 |
54 | 20083 |
55 | 33012 |
56 | 20497 |
57 | 13246 |
58 | 15348 |
59 | 19581 |
60 | 410483 |
61 | 6219 |
62 | 13653 |
63 | 41542 |
64 | 45537 |
65 | 79484 |
66 | 12539 |
67 | 45043 |
68 | 59435 |
69 | 6660 |
70 | 20400 |
71 | 24078 |
72 | 4603 |
73 | 13915 |
74 | 71178 |
75 | 2774 |
76 | 68828 |
77 | 18084 |
78 | 46871 |
79 | 51491 |
80 | 53187 |
81 | 32070 |
82 | 22487 |
83 | 15589 |
84 | 33407 |
85 | 20027 |
86 | 38069 |
87 | 11069 |
88 | 14342 |
89 | 1027 |
90 | 72265 |
91 | 12357 |
92 | 367146 |
93 | 6907 |
94 | 7758 |
95 | 31233 |
96 | 55699 |
97 | 34655 |
98 | 30033 |
99 | 11410 |
100 | 12380 |
# plot the number of records per county
ggplot(d) +
geom_bar(aes(x = forcats::fct_infreq(county_id))) +
theme(panel.grid.major = element_blank(),
axis.text.x = element_text(angle = 90, hjust=1, vjust = 0.5)
)
county_desc
appears to be a text label corresponding to county_desc
.
Check that the county descriptions are in a 1:1 relationship with the
county IDs.
# number of unique values
d$county_desc %>% unique() %>% length()
[1] 100
# number of unique values of code:label combinations
paste(d$county_id, d$county_desc) %>% unique() %>% length()
[1] 100
# Is code:label a 1:1 relationship?
# Is the number of unique labels equal to the number of unique code:label combinations
(d$county_desc %>% unique() %>% length()) ==
(paste(d$county_id, d$county_desc) %>% unique() %>% length())
[1] TRUE
county_desc
in 1:1 relationship with county_id
They look reasonable, to the extent that I can tell without knowing anything about the counties.
voter_reg_num
- Voter registration number (unique by county)
# some from the beginning of the file
d$voter_reg_num %>% head()
[1] "000000000001" "000000000001" "000000000001" "000000000001" "000000000001"
[6] "000000000001"
# some from the end of the file
d$voter_reg_num %>% tail()
[1] "000099848837" "000099848838" "000099848840" "000099848841" "000099870963"
[6] "000401437666"
# number of unique values
d$voter_reg_num %>% unique() %>% length()
[1] 1786064
# summary of distribution of voter registration number interpreted as a number
summary(as.integer(d$voter_reg_num))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 44722 223167 6670211 7629018 401437666
~1.8M unique values
Never missing
Integer 1 .. ~400M (as strings)
12-digit integers with leading zeroes
Check whether county_id
× voter_reg_num
is unique, as
claimed.
# number of records
nrow(d)
[1] 4099699
# number of unique county_id x voter_reg_num combinations
paste(d$county_id, d$voter_reg_num) %>% unique() %>% length()
[1] 4099699
# Are the county_id x voter_reg_num combinations unique?
# Number of unique county_id x voter_reg_num combinations equals the number of rows?
nrow(d) ==
(paste(d$county_id, d$voter_reg_num) %>% unique() %>% length())
[1] TRUE
county_id
× voter_reg_num
is unique, as claimedregistr_dt
- Voter registration date
# summary of distribution of registration date interpreted as a date
d$registr_dt %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
"1899-09-14" "1988-01-01" "1997-01-09" "1993-01-18" "2002-11-05" "9999-10-21"
# Get records apparently registered after the snapshot was taken
x <- d %>%
dplyr::filter(registr_dt > lubridate::ymd("2005-11-25")) # after snapshot date
# Number of records apparently registered after the snapshot was taken
nrow(x)
[1] 18
# Show records apparently registered after the snapshot was taken
x %>%
dplyr::arrange(registr_dt) %>%
dplyr::select(
registr_dt, county_desc, voter_reg_num, last_name, first_name,
street_name, street_type_cd, res_city_desc, age
) %>%
knitr::kable()
registr_dt | county_desc | voter_reg_num | last_name | first_name | street_name | street_type_cd | res_city_desc | age |
---|---|---|---|---|---|---|---|---|
2007-08-15 | SURRY | 000030004622 | ALLEN | SEAN | KENSINGTON | DR | MOUNT AIRY | 23 |
2007-10-12 | GASTON | 000007601045 | MOORE | GEORGE | UPPER SPENCER MOUNTAIN | RD | STANLEY | 56 |
2008-10-05 | GASTON | 000007600410 | HAMRICK | JIMMY | RALPHS | BLVD | GASTONIA | 35 |
2008-10-05 | GASTON | 000007600823 | MARTIN | JASON | PAMELA | ST | GASTONIA | 30 |
2008-10-11 | GASTON | 000007600617 | HUNSUCKER | JESSICA | ROLLINGWOOD | DR | STANLEY | 23 |
2011-06-11 | WILSON | 000057476091 | SMITH | FLOYD | FARMWOOD | LOOP | WILSON | 80 |
2022-09-04 | CHOWAN | 000000014190 | MEADS | LEONARD | MACEDONIA | RD | EDENTON | 34 |
2201-06-12 | MACON | 000000034702 | MCGEE | MACK | MASHBURN BRANCH | RD | FRANKLIN | 72 |
2201-09-18 | ROCKINGHAM | 000000102698 | HAIZLIP | JAMES | NC 87 | NA | EDEN | 39 |
2201-11-28 | CASWELL | 000000021711 | WHARTON | REGINA | 7TH | ST | YANCEYVILLE | 25 |
2801-11-01 | ORANGE | 000000196807 | GAUDIO | LAUREN | HILLSBOROUGH | ST | CHAPEL HILL | 24 |
3001-09-25 | WILSON | 000057476878 | MCGLAUGHON | REBECCA | BLOOMERY | RD | WILSON | 26 |
3663-06-25 | WILSON | 000057476124 | RENFROW | TERRI | OLD RALEIGH | RD | WILSON | 24 |
5113-08-07 | NASH | 000000068243 | SCHULTE | MATTHEW | SUNSET | AVE | ROCKY MOUNT | 22 |
7614-03-05 | PAMLICO | 000006450688 | MILLER | ANITA | FLORENCE | ST | NEW BERN | 22 |
8480-10-09 | CALDWELL | 000014470774 | PHILLIPS | STEVEN | BENFIELD | DR | HUDSON | 52 |
9482-03-11 | SWAIN | 000000001209 | DEHART | DEBORAH | SHEPHERDS CREEK | RD | BRYSON CITY | 42 |
9999-10-21 | ALAMANCE | 000009066908 | WOOTEN | HEATHER | CAROLINA | CIR | GRAHAM | 24 |
Never missing
18 records have registration date after the snapshot date
Some records have early registration dates
Investigate the early registration dates.
First form a view on how early is too early by finding the maximum age and assuming registration at 21 years of age.
# summary of distribution of age interpreted as an integer
d$age %>% as.integer() %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 33.00 45.00 46.93 58.00 221.00
# get some extreme quantiles of the age distribution
d$age %>% as.integer() %>% quantile(probs = c(0.003, 0.004, 0.995, 0.996, 0.997, 0.998, 0.999))
0.3% 0.4% 99.5% 99.6% 99.7% 99.8% 99.9%
0 18 98 105 105 105 204
# plot the distribution of age
d %>%
dplyr::mutate(age = as.integer(age)) %>%
dplyr::filter(age >= 80) %>%
ggplot() +
geom_vline(xintercept = c(105, 125, 204), colour = "red") +
geom_histogram(aes(x = age), binwidth = 1) +
scale_y_log10()
Warning: Transformation introduced infinite values in continuous y-axis
Warning: Removed 79 rows containing missing values (geom_bar).
That opened a can of worms. There are obviously some issues with age
.
I will deal with that in detail in a later notebook.
Without considering age
in detail, it appears that the maximum
accurate age is not more than 120 years.
Assume that the maximum possible voter age is 116 years. The minimum registration age in North Carolina is 16 years (although I have no idea what it was 100 years ago). Therefore, assume that the oldest possible voter could have registered 100 years prior to the snapshot date. That is, regard any registration earlier than 1905-11-25 as very unlikely to be correct.
Now look at the distribution of registration dates that are no later than the snapshot date.
d %>%
dplyr::filter(registr_dt <= lubridate::ymd("2005-11-25")) %>%
ggplot() +
geom_vline(xintercept = c(lubridate::ymd("1905-11-25"), lubridate::ymd("1935-11-25")),
colour = "red") +
geom_histogram(aes(x = registr_dt), binwidth = 365.25) + # 1yr bins
scale_y_sqrt()
Look at the relationship between age and registration date. The vast majority of these records will be OK, so may make it easier to spot anomalous regions.
First look at all the records (excluding those with registration date after the snapshot date).
d %>%
dplyr::mutate(age = as.integer(age)) %>%
dplyr::filter(registr_dt <= lubridate::ymd("2005-11-25")) %>%
ggplot() +
geom_hex(aes(x = age, y = registr_dt, fill = stat(log10(count))), binwidth = c(1, 365.25)) # 1yr bins x&y
The heavily populated triangualr region contains most of the cases and shows the (mostly) plausible combinations of registration date and age at snapshot date.
Now exclude the manifestly unlikely ages (< 18 or > 104 years).
d %>%
dplyr::mutate(age = as.integer(age)) %>%
dplyr::filter(
dplyr::between(registr_dt, lubridate::ymd("1901-01-01"), lubridate::ymd("2005-11-25")),
dplyr::between(age, 18, 104)
) %>%
ggplot() +
geom_hex(aes(x = age, y = registr_dt, fill = stat(log10(count))), binwidth = c(1, 365.25)) # 1yr bins x&y
The blue’ish upper triangle corresponds to people who were at least 18 years old at registration.
The black fringe below the blue-ish upper triangle corresponds to people who were less that 18 years old at registration.
The negative diagonal line corresponds to people who would have been zero years old at registration.
The points below the negative diagonal line correspond to people who appear to have been registered before they were born.
Most registration dates are consistent with age
A significant fraction of registration dates are inconsistent with age.
There appear to be a nontrivial number of age and registration date combinations that are implausible. These are most likely due to typos in those variables.
cancellation_dt
- Cancellation date
# summary of distribution of registration date interpreted as a date
d$cancellation_dt %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
"1994-10-18" "1996-12-30" "1997-01-16" "1996-12-22" "1997-01-27" "2004-10-05"
NA's
"4095558"
# look at the fraction of missing values
table(missing = is.na(d$cancellation_dt))
missing
FALSE TRUE
4141 4095558
table(missing = is.na(d$cancellation_dt)) %>% prop.table() %>% round(3)
missing
FALSE TRUE
0.001 0.999
# plot the distribution of nonmissing cancellation date
d %>%
dplyr::filter(!is.na(cancellation_dt)) %>% # not missing
ggplot() +
geom_histogram(aes(x = cancellation_dt), binwidth = 7) + # 1wk bins
scale_y_sqrt()
Almost always missing
Concentrated in 1996 and early 1997 (presumably some sort of administrative purge)
It is not clear what having a cancellation date means for records that are flagged as ACTIVE & VERIFIED. Perhaps they had been removed from the electoral roll in the past and subsequently reinstated.
Given the high proportion of missing values there is no point keeping cancellation_dt
.
Write a function to drop it.
# Function to drop cancel_dt
raw_entity_data_drop_cancel_dt <- function(
d # data frame - raw entity data
) {
d %>%
dplyr::select(-cancellation_dt)
}
Apply the filter and track the number of rows before and after the filter.
# number of columns before dropping
d %>%
names() %>% length
[1] 25
d %>%
raw_entity_data_drop_cancel_dt() %>%
# number of columns after dropping
names() %>% length
[1] 24
Computation time (excl. render): 256.585 sec elapsed
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] ggplot2_3.3.3 forcats_0.5.1 lubridate_1.7.10 skimr_2.1.3
[5] stringr_1.4.0 tibble_3.1.0 vroom_1.4.0 fs_1.5.0
[9] tictoc_1.0 here_1.0.1 workflowr_1.6.2 targets_0.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lattice_0.20-41 ps_1.6.0 assertthat_0.2.1
[5] rprojroot_2.0.2 digest_0.6.27 utf8_1.2.1 R6_2.5.0
[9] repr_1.1.3 evaluate_0.14 highr_0.8 pillar_1.5.1
[13] rlang_0.4.10 data.table_1.14.0 hexbin_1.28.2 callr_3.6.0
[17] jquerylib_0.1.3 rmarkdown_2.7 labeling_0.4.2 igraph_1.2.6
[21] bit_4.0.4 munsell_0.5.0 compiler_4.0.3 httpuv_1.5.5
[25] xfun_0.22 pkgconfig_2.0.3 base64enc_0.1-3 htmltools_0.5.1.1
[29] tidyselect_1.1.0 bookdown_0.21 codetools_0.2-18 fansi_0.4.2
[33] crayon_1.4.1 dplyr_1.0.5 withr_2.4.1 later_1.1.0.1
[37] grid_4.0.3 jsonlite_1.7.2 gtable_0.3.0 lifecycle_1.0.0
[41] DBI_1.1.1 git2r_0.28.0 magrittr_2.0.1 scales_1.1.1
[45] cli_2.3.1 stringi_1.5.3 farver_2.1.0 renv_0.13.1
[49] promises_1.2.0.1 bslib_0.2.4 ellipsis_0.3.1 generics_0.1.0
[53] vctrs_0.3.7 tools_4.0.3 bit64_4.0.5 glue_1.4.2
[57] purrr_0.3.4 processx_3.5.0 parallel_4.0.3 yaml_2.2.1
[61] colorspace_2.0-0 knitr_1.31 sass_0.3.1