Last updated: 2021-03-27

Checks: 6 1

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201104)

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 462213b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 462213b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    _targets/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/blk_char.fst
    Ignored:    output/ent_blk.fst
    Ignored:    output/ent_cln.fst
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/local/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  analysis/m_01_1_get_raw_entity_data.Rmd
    Untracked:  analysis/m_01_2_parse_dates.Rmd

Unstaged changes:
    Modified:   R/functions.R
    Modified:   _targets.R
    Modified:   analysis/index.Rmd
    Deleted:    analysis/m_01_get_raw_entity_data.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

# NOTE this notebook can be run manually or automatically by {targets}
# So load the packages required by this notebook here
# rather than relying on _targets.R to load them.

# Set up the project environment, because {workflowr} knits each Rmd file 
# in a new R session, and doesn't execute the project .Rprofile

library(targets) # access data from the targets cache

library(tictoc) # capture execution time
library(here) # construct file paths relative to project root
library(fs) # file system operations
library(vroom) # fast reading of delimited text files
library(tibble) # enhanced data frames

# start the execution time clock
tictoc::tic("Computation time (excl. render)")

# Get the path to the raw entity data file
# This is a target managed by {targets}
f_entity_raw_tsv <- tar_read(c_raw_entity_data_file)

1 Introduction

The aim of this set of meta notebooks is to work out how to read the raw entity data. and get it sufficiently neatened so that we can construct standardised names and modelling features without needing any further neatening. To be clear, the target (c_raw_entity_data) corresponding to the objective of this set of notebooks is the neatened raw data, before constructing any modelling features.

This notebook documents the process of working out how to read the raw entity data. This is necessary because the documentation of data is often ambiguous.

The subsequent notebooks in this set will check that all the columns have been read correctly and work out how to fix them, if necessary. The final notebook in this set works out how to save the neatened data (c_raw_entity_data).

1.1 Entity data

This project uses historical voter registration data from the North Carolina State Board of Elections. This information is made publicly available in accordance with North Carolina state law. The Voter Registration Data page links to an online folder of Voter Registration snapshots, which contains the snapshot data files and a data dictionary file describing the layout of the snapshot data files. At the time of writing the snapshot files cover the years 2005 to 2020 with at least one snapshot per year. The files are ZIP compressed and relatively large, with the smallest being 572 MB after compression.

The snapshots contain many columns that are irrelevant to this project (e.g. school district name) and/or prohibited under Australian privacy law (e.g. political affiliation, race). We do not read these unneeded columns from the snapshot file.

We use only one snapshot file (VR_Snapshot_20051125.zip) because this project does not investigate linkage of records across time. We chose the oldest snapshot (2005) because it is the smallest and the contents are the most out of date, minimising the current information made available. Note that this project will not generate any information that is not already directly, publicly available from NCSBE.

2 Display data dictionary

The data dictionary is stored in the data/ directory.

f_entity_raw_dd <- here::here("data", "layout_VR_Snapshot.txt") # data dictionary file

readLines(f_entity_raw_dd) %>% writeLines()

/* *******************************************************************************
* name:    layout_VR_Snapshot.txt
* purpose: Layout for the VR_SNAPSHOT_YYYYMMDD file. This file contains a denormalized
*          point-in-time snapshot of information for active and inactive voters 
*          as-well-as removed voters going back for a period of ten years.
* format:  tab delimited column names in first row
* updated: 06/28/2020
******************************************************************************* */


-- --------------------------------------------------------------------------------
name                            data type       description
-- --------------------------------------------------------------------------------
snapshot_dt         char 10         Date of snapshot
county_id           char  3         County identification number
county_desc         char 15         County description
voter_reg_num           char 12         Voter registration number (unique by county)
ncid                char 12         North Carolina identification number (NCID) of voter
status_cd           char  1         Status code for voter registration
voter_status_desc       char 10         Satus code descriptions.
reason_cd           char  2         Reason code for voter registration status
voter_status_reason_desc    char 60         Reason code description
absent_ind          char  1         <not used> 
name_prefx_cd           char  4         <not used> 
last_name           char 25         Voter last name
first_name          char 20         Voter first name
midl_name           char 20         Voter middle name
name_sufx_cd            char  4         Voter name suffix 
house_num           char 10         Residential address street number
half_code           char  1         Residential address street number half code
street_dir          char  2         Residential address street direction (N,S,E,W,NE,SW, etc.)
street_name         char 30         Residential address street name
street_type_cd          char  4         Residential address street type (RD, ST, DR, BLVD, etc.)
street_sufx_cd          char  4         Residential address street suffix (BUS, EXT, and directional)
unit_designator         char  4         <not used>
unit_num            char  7         Residential address unit number
res_city_desc           char 20         Residential address city name
state_cd            char  2         Residential address state code
zip_code            char  9         Residential address zip code
mail_addr1          char 40         Mailing street address
mail_addr2          char 40         Mailing address line two
mail_addr3          char 40         Mailing address line three
mail_addr4          char 40         Mailing address line four
mail_city           char 30         Mailing address city name
mail_state          char  2         Mailing address state code
mail_zipcode            char  9         Mailing address zip code
area_cd             char  3         Area code for phone number
phone_num           char  7         Telephone number
race_code           char  3         Race code
race_desc           char 35         Race description
ethnic_code         char  2         Ethnicity code
ethnic_desc         char 30         Ethnicity description
party_cd            char  3         Party affiliation code
party_desc          char 12         Party affiliation description
sex_code            char  1         Gender code
sex             char  6         Gender description
age             char  3         Age
birth_place         char  2         Birth place  
registr_dt          char 10         Voter registration date
precinct_abbrv          char  6         Precinct abbreviation
precinct_desc           char 30         Precinct name
municipality_abbrv      char  4         Municipality abbreviation   
municipality_desc       char 30         Municipality name
ward_abbrv          char  4         Ward abbreviation
ward_desc           char 30         Ward name
cong_dist_abbrv         char  4         Congressional district abbreviation 
cong_dist_desc          char 30         Congressional district name
super_court_abbrv       char  4         Supreme Court abbreviation 
super_court_desc        char 30         Supreme Court name
judic_dist_abbrv        char  4         Judicial district abbreviation 
judic_dist_desc         char 30         Judicial district name
NC_senate_abbrv         char  4         NC Senate district abbreviation 
NC_senate_desc          char 30         NC Senate district name
NC_house_abbrv          char  4         NC House district abbreviation 
NC_house_desc           char 30         NC House district name
county_commiss_abbrv        char  4         County Commissioner district abbreviation 
county_commiss_desc     char 30         County Commissioner district name
township_abbrv          char  6         Township district abbreviation
township_desc           char 30         Township district name
school_dist_abbrv       char  6         School district abbreviation
school_dist_desc        char 30         School district name
fire_dist_abbrv         char  4         Fire district abbreviation 
fire_dist_desc          char 30         Fire district name
water_dist_abbrv        char  4         Water district abbreviation 
water_dist_desc         char 30         Water district name
sewer_dist_abbrv        char  4         Sewer district abbreviation 
sewer_dist_desc         char 30         Sewer district name
sanit_dist_abbrv        char  4         Sanitation district abbreviation 
sanit_dist_desc         char 30         Sanitation district name
rescue_dist_abbrv       char  4         Rescue district abbreviation 
rescue_dist_desc        char 30         Rescue district name
munic_dist_abbrv        char  4         Municipal district abbreviation 
munic_dist_desc         char 30         Municipal district name
dist_1_abbrv            char  4         Prosecutorial district abbreviation 
dist_1_desc         char 30         Prosecutorial district name
dist_2_abbrv            char  4         <not used>
dist_2_desc         char 30         <not used>
confidential_ind        char  1         Confidential indicator
cancellation_dt         char 10         Cancellation date
vtd_abbrv           char  6         Voter tabuluation district abbreviation 
vtd_desc            char 30         Voter tabuluation district name 
load_dt             char 10         Data load date
age_group           char 35         Age group range
-- ---------------------------------------------------------------------------------

3 Read entity data

The snapshot ZIP file was manually downloaded (572 MB), uncompressed (5.7 GB), then re-compressed in XZ format to minimise the size (248 MB). The compressed snapshot file and the data dictionary file are stored in the data/ directory.

The data is tab-separated. The data dictionary says that the data file is tab separated, but the data dictionary gives column widths, which could be interpreted as implying the data is formatted as fixed width fields.Examining the data with a text editor shows that the columns are tab separated.

The field widths in the data dictionary (interpreted as maximum lengths) are not accurate. Some fields contain values longer than the stated width.

Inspection of the raw data with a text editor shows that the character fields are unquoted. However, at least one character value contains a double-quote character, which has the potential to confuse the parsing if it is looking for quoted values.

The column specifications are written by taking the column names and their order in the data dictionary as correct.

Read the data file as character columns, to simplify finding wrongly formatted input.

# Function to get the raw entity data
raw_entity_data_get <- function(
  file_path # character - file path usable by vroom
)
  vroom::vroom(
    file_path,
    # n_max = 1e4, # limit the rows for testing
    col_select = c( # get all the columns that might conceivably be used
      # the names and ordering are from the metadata file
      snapshot_dt : voter_status_reason_desc, # 9 cols
      last_name : street_sufx_cd, # 10 cols
      unit_num : zip_code, # 4 cols
      area_cd, phone_num, # 2 cols
      sex_code : registr_dt, # 5 cols
      cancellation_dt, load_dt # 2 cols
    ), # total 32 cols
    col_types = cols(
      .default = col_character() # all cols as chars to allow for bad formatting
    ),
    delim = "\t", # assume that fields are *only* delimited by tabs
    col_names = TRUE, # use the column names on the first line of data
    na = "", # missing fields are empty string or whitespace only (see trim_ws argument)
    quote = "", # don't allow for quoted strings
    comment = "", # don't allow for comments
    trim_ws = TRUE, # trim leading and trailing whitespace
    escape_double = FALSE, # assume no escaped quotes
    escape_backslash = FALSE # assume no escaped backslashes
  )

# Show the data file name
fs::path_file(f_entity_raw_tsv)

[1] "VR_20051125.txt.xz"

d <- raw_entity_data_get(f_entity_raw_tsv)

Check the number of rows and columns read and take a quick look at the data.

dplyr::glimpse(d)

Rows: 8,003,293
Columns: 32
$ snapshot_dt              <chr> "2005-11-25 00:00:00", "2005-11-25 00:00:00",…
$ county_id                <chr> "18", "7", "10", "16", "58", "60", "62", "73"…
$ county_desc              <chr> "CATAWBA", "BEAUFORT", "BRUNSWICK", "CARTERET…
$ voter_reg_num            <chr> "0", "000000000000", "000000000000", "0000000…
$ ncid                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ status_cd                <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", …
$ voter_status_desc        <chr> "REMOVED", "REMOVED", "REMOVED", "REMOVED", "…
$ reason_cd                <chr> "RL", "R2", "R2", "RP", "R2", "RL", "RP", "RP…
$ voter_status_reason_desc <chr> "MOVED FROM COUNTY", "DUPLICATE", "DUPLICATE"…
$ last_name                <chr> "AARON", "THOMPSON", "WILSON", "LANGSTON", "B…
$ first_name               <chr> "CHARLES", "JESSICA", "WILLIAM", "VON", "LIZZ…
$ midl_name                <chr> "F", "RUTH", "B", NA, "IRENE", "R", "HUGHES",…
$ name_sufx_cd             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ house_num                <chr> "0", "961", "0", "264", "1536", "1431", "171"…
$ half_code                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ street_dir               <chr> NA, NA, NA, NA, NA, "E", NA, NA, NA, NA, NA, …
$ street_name              <chr> "ROUTE 4", "TAYLOR", "MIRROR LAKE", "CARL GAR…
$ street_type_cd           <chr> NA, "RD", NA, "RD", "RD", "ST", NA, NA, NA, "…
$ street_sufx_cd           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ unit_num                 <chr> "147 BA", NA, NA, NA, NA, "1", NA, NA, NA, NA…
$ res_city_desc            <chr> "CONOVER", "CHOCOWINITY", "BOILING SPRING LAK…
$ state_cd                 <chr> "NC", "NC", "NC", "NC", "NC", "NC", "NC", NA,…
$ zip_code                 <chr> "28613", "27817", "28461", "28570", "27892", …
$ area_cd                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ phone_num                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sex_code                 <chr> "M", "F", "U", "M", "F", "F", "M", "U", "U", …
$ sex                      <chr> "MALE", "FEMALE", "UNK", "MALE", "FEMALE", "F…
$ age                      <chr> "62", "26", "0", "58", "63", "30", "93", "0",…
$ birth_place              <chr> NA, "NC", NA, "MI", NA, "VA", "NC", NA, NA, "…
$ registr_dt               <chr> "1984-10-06 00:00:00", "2000-07-31 00:00:00",…
$ cancellation_dt          <chr> NA, "2001-07-06 00:00:00", "2001-02-05 00:00:…
$ load_dt                  <chr> "2014-07-15 22:21:54.150000000", "2014-07-15 …

Correct number of data rows read
- External line count of input file = 8,003,294 (including header row of column names)
Correct number of columns read (checked against manual count of columns in data dictionary)
The initial values in each column seem plausible with respect to the column description

Timing

Computation time (excl. render): 29.386 sec elapsed

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] tibble_3.1.0    vroom_1.4.0     fs_1.5.0        tictoc_1.0     
[5] here_1.0.1      workflowr_1.6.2 targets_0.2.0  

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.22         bslib_0.2.4       purrr_0.3.4      
 [5] vctrs_0.3.6       generics_0.1.0    htmltools_0.5.1.1 yaml_2.2.1       
 [9] utf8_1.2.1        rlang_0.4.10      later_1.1.0.1     pillar_1.5.1     
[13] jquerylib_0.1.3   DBI_1.1.1         glue_1.4.2        withr_2.4.1      
[17] bit64_4.0.5       lifecycle_1.0.0   stringr_1.4.0     codetools_0.2-18 
[21] evaluate_0.14     knitr_1.31        callr_3.5.1       httpuv_1.5.5     
[25] ps_1.6.0          parallel_4.0.3    fansi_0.4.2       Rcpp_1.0.6       
[29] renv_0.13.1       promises_1.2.0.1  jsonlite_1.7.2    bit_4.0.4        
[33] digest_0.6.27     stringi_1.5.3     bookdown_0.21     processx_3.4.5   
[37] dplyr_1.0.5       rprojroot_2.0.2   cli_2.3.1         tools_4.0.3      
[41] magrittr_2.0.1    sass_0.3.1        crayon_1.4.1      pkgconfig_2.0.3  
[45] ellipsis_0.3.1    data.table_1.14.0 assertthat_0.2.1  rmarkdown_2.7    
[49] R6_2.5.0          igraph_1.2.6      compiler_4.0.3    git2r_0.28.0

[meta] Read the raw entity data

m_01_1_get_raw_entity_data

Ross Gayler

2021-03-06

1 Introduction

1.1 Entity data

2 Display data dictionary

3 Read entity data

Timing