Last updated: 2021-03-27
Checks: 6 1
Knit directory:
fa_sim_cal/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown is untracked by Git.
To know which version of the R Markdown file created these
results, you’ll want to first commit it to the Git repo. If
you’re still working on the analysis, you can ignore this
warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and
build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201104)
was run prior to running the code in the R Markdown file.
Setting a seed ensures that any results that rely on randomness, e.g.
subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 462213b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the
analysis have been committed to Git prior to generating the results (you can
use wflow_publish
or wflow_git_commit
). workflowr only
checks the R Markdown file, but you know if there are other scripts or data
files that it depends on. Below is the status of the Git repository when the
results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .tresorit/
Ignored: _targets/
Ignored: data/VR_20051125.txt.xz
Ignored: output/blk_char.fst
Ignored: output/ent_blk.fst
Ignored: output/ent_cln.fst
Ignored: output/ent_raw.fst
Ignored: renv/library/
Ignored: renv/local/
Ignored: renv/staging/
Untracked files:
Untracked: analysis/m_01_1_get_raw_entity_data.Rmd
Untracked: analysis/m_01_2_parse_dates.Rmd
Unstaged changes:
Modified: R/functions.R
Modified: _targets.R
Modified: analysis/index.Rmd
Deleted: analysis/m_01_get_raw_entity_data.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish()
to start tracking its development.
# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.
# Set up the project environment, because {workflowr} knits each Rmd file
# in a new R session, and doesn't execute the project .Rprofile
library(targets) # access data from the targets cache
library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
library(fs) # file system operations
library(vroom) # fast reading of delimited text files
library(tibble) # enhanced data frames
# start the execution time clock
tictoc::tic("Computation time (excl. render)")
# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)
The aim of this set of meta notebooks is to work out how to read the raw
entity data. and get it sufficiently neatened so that we can construct
standardised names and modelling features without needing any further
neatening. To be clear, the target (c_raw_entity_data
) corresponding
to the objective of this set of notebooks is the neatened raw data,
before constructing any modelling features.
This notebook documents the process of working out how to read the raw entity data. This is necessary because the documentation of data is often ambiguous.
The subsequent notebooks in this set will check that all the columns
have been read correctly and work out how to fix them, if necessary. The
final notebook in this set works out how to save the neatened data
(c_raw_entity_data
).
This project uses historical voter registration data from the North Carolina State Board of Elections. This information is made publicly available in accordance with North Carolina state law. The Voter Registration Data page links to an online folder of Voter Registration snapshots, which contains the snapshot data files and a data dictionary file describing the layout of the snapshot data files. At the time of writing the snapshot files cover the years 2005 to 2020 with at least one snapshot per year. The files are ZIP compressed and relatively large, with the smallest being 572 MB after compression.
The snapshots contain many columns that are irrelevant to this project (e.g. school district name) and/or prohibited under Australian privacy law (e.g. political affiliation, race). We do not read these unneeded columns from the snapshot file.
We use only one snapshot file (VR_Snapshot_20051125.zip) because this project does not investigate linkage of records across time. We chose the oldest snapshot (2005) because it is the smallest and the contents are the most out of date, minimising the current information made available. Note that this project will not generate any information that is not already directly, publicly available from NCSBE.
The data dictionary is stored in the data/
directory.
f_entity_raw_dd <- here::here("data", "layout_VR_Snapshot.txt") # data dictionary file
readLines(f_entity_raw_dd) %>% writeLines()
/* *******************************************************************************
* name: layout_VR_Snapshot.txt
* purpose: Layout for the VR_SNAPSHOT_YYYYMMDD file. This file contains a denormalized
* point-in-time snapshot of information for active and inactive voters
* as-well-as removed voters going back for a period of ten years.
* format: tab delimited column names in first row
* updated: 06/28/2020
******************************************************************************* */
-- --------------------------------------------------------------------------------
name data type description
-- --------------------------------------------------------------------------------
snapshot_dt char 10 Date of snapshot
county_id char 3 County identification number
county_desc char 15 County description
voter_reg_num char 12 Voter registration number (unique by county)
ncid char 12 North Carolina identification number (NCID) of voter
status_cd char 1 Status code for voter registration
voter_status_desc char 10 Satus code descriptions.
reason_cd char 2 Reason code for voter registration status
voter_status_reason_desc char 60 Reason code description
absent_ind char 1 <not used>
name_prefx_cd char 4 <not used>
last_name char 25 Voter last name
first_name char 20 Voter first name
midl_name char 20 Voter middle name
name_sufx_cd char 4 Voter name suffix
house_num char 10 Residential address street number
half_code char 1 Residential address street number half code
street_dir char 2 Residential address street direction (N,S,E,W,NE,SW, etc.)
street_name char 30 Residential address street name
street_type_cd char 4 Residential address street type (RD, ST, DR, BLVD, etc.)
street_sufx_cd char 4 Residential address street suffix (BUS, EXT, and directional)
unit_designator char 4 <not used>
unit_num char 7 Residential address unit number
res_city_desc char 20 Residential address city name
state_cd char 2 Residential address state code
zip_code char 9 Residential address zip code
mail_addr1 char 40 Mailing street address
mail_addr2 char 40 Mailing address line two
mail_addr3 char 40 Mailing address line three
mail_addr4 char 40 Mailing address line four
mail_city char 30 Mailing address city name
mail_state char 2 Mailing address state code
mail_zipcode char 9 Mailing address zip code
area_cd char 3 Area code for phone number
phone_num char 7 Telephone number
race_code char 3 Race code
race_desc char 35 Race description
ethnic_code char 2 Ethnicity code
ethnic_desc char 30 Ethnicity description
party_cd char 3 Party affiliation code
party_desc char 12 Party affiliation description
sex_code char 1 Gender code
sex char 6 Gender description
age char 3 Age
birth_place char 2 Birth place
registr_dt char 10 Voter registration date
precinct_abbrv char 6 Precinct abbreviation
precinct_desc char 30 Precinct name
municipality_abbrv char 4 Municipality abbreviation
municipality_desc char 30 Municipality name
ward_abbrv char 4 Ward abbreviation
ward_desc char 30 Ward name
cong_dist_abbrv char 4 Congressional district abbreviation
cong_dist_desc char 30 Congressional district name
super_court_abbrv char 4 Supreme Court abbreviation
super_court_desc char 30 Supreme Court name
judic_dist_abbrv char 4 Judicial district abbreviation
judic_dist_desc char 30 Judicial district name
NC_senate_abbrv char 4 NC Senate district abbreviation
NC_senate_desc char 30 NC Senate district name
NC_house_abbrv char 4 NC House district abbreviation
NC_house_desc char 30 NC House district name
county_commiss_abbrv char 4 County Commissioner district abbreviation
county_commiss_desc char 30 County Commissioner district name
township_abbrv char 6 Township district abbreviation
township_desc char 30 Township district name
school_dist_abbrv char 6 School district abbreviation
school_dist_desc char 30 School district name
fire_dist_abbrv char 4 Fire district abbreviation
fire_dist_desc char 30 Fire district name
water_dist_abbrv char 4 Water district abbreviation
water_dist_desc char 30 Water district name
sewer_dist_abbrv char 4 Sewer district abbreviation
sewer_dist_desc char 30 Sewer district name
sanit_dist_abbrv char 4 Sanitation district abbreviation
sanit_dist_desc char 30 Sanitation district name
rescue_dist_abbrv char 4 Rescue district abbreviation
rescue_dist_desc char 30 Rescue district name
munic_dist_abbrv char 4 Municipal district abbreviation
munic_dist_desc char 30 Municipal district name
dist_1_abbrv char 4 Prosecutorial district abbreviation
dist_1_desc char 30 Prosecutorial district name
dist_2_abbrv char 4 <not used>
dist_2_desc char 30 <not used>
confidential_ind char 1 Confidential indicator
cancellation_dt char 10 Cancellation date
vtd_abbrv char 6 Voter tabuluation district abbreviation
vtd_desc char 30 Voter tabuluation district name
load_dt char 10 Data load date
age_group char 35 Age group range
-- ---------------------------------------------------------------------------------
The snapshot ZIP file was manually downloaded (572 MB), uncompressed
(5.7 GB), then re-compressed in XZ
format to minimise the size
(248 MB). The compressed snapshot file and the data dictionary file are
stored in the data/
directory.
The data is tab-separated. The data dictionary says that the data file is tab separated, but the data dictionary gives column widths, which could be interpreted as implying the data is formatted as fixed width fields.Examining the data with a text editor shows that the columns are tab separated.
The field widths in the data dictionary (interpreted as maximum lengths) are not accurate. Some fields contain values longer than the stated width.
Inspection of the raw data with a text editor shows that the character fields are unquoted. However, at least one character value contains a double-quote character, which has the potential to confuse the parsing if it is looking for quoted values.
The column specifications are written by taking the column names and their order in the data dictionary as correct.
Read the data file as character columns, to simplify finding wrongly formatted input.
# Function to get the raw entity data
raw_entity_data_get <- function(
file_path # character - file path usable by vroom
)
vroom::vroom(
file_path,
# n_max = 1e4, # limit the rows for testing
col_select = c( # get all the columns that might conceivably be used
# the names and ordering are from the metadata file
snapshot_dt : voter_status_reason_desc, # 9 cols
last_name : street_sufx_cd, # 10 cols
unit_num : zip_code, # 4 cols
area_cd, phone_num, # 2 cols
sex_code : registr_dt, # 5 cols
cancellation_dt, load_dt # 2 cols
), # total 32 cols
col_types = cols(
.default = col_character() # all cols as chars to allow for bad formatting
),
delim = "\t", # assume that fields are *only* delimited by tabs
col_names = TRUE, # use the column names on the first line of data
na = "", # missing fields are empty string or whitespace only (see trim_ws argument)
quote = "", # don't allow for quoted strings
comment = "", # don't allow for comments
trim_ws = TRUE, # trim leading and trailing whitespace
escape_double = FALSE, # assume no escaped quotes
escape_backslash = FALSE # assume no escaped backslashes
)
# Show the data file name
fs::path_file(f_entity_raw_tsv)
[1] "VR_20051125.txt.xz"
d <- raw_entity_data_get(f_entity_raw_tsv)
Check the number of rows and columns read and take a quick look at the data.
dplyr::glimpse(d)
Rows: 8,003,293
Columns: 32
$ snapshot_dt <chr> "2005-11-25 00:00:00", "2005-11-25 00:00:00",…
$ county_id <chr> "18", "7", "10", "16", "58", "60", "62", "73"…
$ county_desc <chr> "CATAWBA", "BEAUFORT", "BRUNSWICK", "CARTERET…
$ voter_reg_num <chr> "0", "000000000000", "000000000000", "0000000…
$ ncid <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ status_cd <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", …
$ voter_status_desc <chr> "REMOVED", "REMOVED", "REMOVED", "REMOVED", "…
$ reason_cd <chr> "RL", "R2", "R2", "RP", "R2", "RL", "RP", "RP…
$ voter_status_reason_desc <chr> "MOVED FROM COUNTY", "DUPLICATE", "DUPLICATE"…
$ last_name <chr> "AARON", "THOMPSON", "WILSON", "LANGSTON", "B…
$ first_name <chr> "CHARLES", "JESSICA", "WILLIAM", "VON", "LIZZ…
$ midl_name <chr> "F", "RUTH", "B", NA, "IRENE", "R", "HUGHES",…
$ name_sufx_cd <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ house_num <chr> "0", "961", "0", "264", "1536", "1431", "171"…
$ half_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ street_dir <chr> NA, NA, NA, NA, NA, "E", NA, NA, NA, NA, NA, …
$ street_name <chr> "ROUTE 4", "TAYLOR", "MIRROR LAKE", "CARL GAR…
$ street_type_cd <chr> NA, "RD", NA, "RD", "RD", "ST", NA, NA, NA, "…
$ street_sufx_cd <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ unit_num <chr> "147 BA", NA, NA, NA, NA, "1", NA, NA, NA, NA…
$ res_city_desc <chr> "CONOVER", "CHOCOWINITY", "BOILING SPRING LAK…
$ state_cd <chr> "NC", "NC", "NC", "NC", "NC", "NC", "NC", NA,…
$ zip_code <chr> "28613", "27817", "28461", "28570", "27892", …
$ area_cd <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ phone_num <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sex_code <chr> "M", "F", "U", "M", "F", "F", "M", "U", "U", …
$ sex <chr> "MALE", "FEMALE", "UNK", "MALE", "FEMALE", "F…
$ age <chr> "62", "26", "0", "58", "63", "30", "93", "0",…
$ birth_place <chr> NA, "NC", NA, "MI", NA, "VA", "NC", NA, NA, "…
$ registr_dt <chr> "1984-10-06 00:00:00", "2000-07-31 00:00:00",…
$ cancellation_dt <chr> NA, "2001-07-06 00:00:00", "2001-02-05 00:00:…
$ load_dt <chr> "2014-07-15 22:21:54.150000000", "2014-07-15 …
Computation time (excl. render): 29.386 sec elapsed
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] tibble_3.1.0 vroom_1.4.0 fs_1.5.0 tictoc_1.0
[5] here_1.0.1 workflowr_1.6.2 targets_0.2.0
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 xfun_0.22 bslib_0.2.4 purrr_0.3.4
[5] vctrs_0.3.6 generics_0.1.0 htmltools_0.5.1.1 yaml_2.2.1
[9] utf8_1.2.1 rlang_0.4.10 later_1.1.0.1 pillar_1.5.1
[13] jquerylib_0.1.3 DBI_1.1.1 glue_1.4.2 withr_2.4.1
[17] bit64_4.0.5 lifecycle_1.0.0 stringr_1.4.0 codetools_0.2-18
[21] evaluate_0.14 knitr_1.31 callr_3.5.1 httpuv_1.5.5
[25] ps_1.6.0 parallel_4.0.3 fansi_0.4.2 Rcpp_1.0.6
[29] renv_0.13.1 promises_1.2.0.1 jsonlite_1.7.2 bit_4.0.4
[33] digest_0.6.27 stringi_1.5.3 bookdown_0.21 processx_3.4.5
[37] dplyr_1.0.5 rprojroot_2.0.2 cli_2.3.1 tools_4.0.3
[41] magrittr_2.0.1 sass_0.3.1 crayon_1.4.1 pkgconfig_2.0.3
[45] ellipsis_0.3.1 data.table_1.14.0 assertthat_0.2.1 rmarkdown_2.7
[49] R6_2.5.0 igraph_1.2.6 compiler_4.0.3 git2r_0.28.0