Last updated: 2024-09-16

Checks: 6 1

Knit directory: PODFRIDGE/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230302)

The command set.seed(20230302) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 2479f41

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 2479f41. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  data/final_CODIS_data.csv

Unstaged changes:
    Deleted:    analysis/CODIS_DB_composition.Rmd
    Deleted:    analysis/IBD.Rmd
    Deleted:    analysis/STR.Rmd
    Deleted:    analysis/analyses.Rmd
    Deleted:    analysis/background.Rmd
    Deleted:    analysis/demo-data.Rmd
    Modified:   analysis/index.Rmd
    Deleted:    analysis/manuscript_figures.Rmd
    Modified:   analysis/regression.Rmd
    Deleted:    analysis/representation-risk.Rmd
    Deleted:    slurm-11226865.out
    Deleted:    slurm-11226959.out
    Deleted:    slurm-11227723.out
    Deleted:    slurm-11228488.out
    Deleted:    slurm-11228791.out
    Deleted:    slurm-11229998.out
    Deleted:    slurm-11230240.out

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/regression.Rmd) and HTML (docs/regression.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	5d7baf2	hcvw	2024-09-06	Build site.
html	a492d9c	hcvw	2024-09-06	Build site.
html	a60d243	hcvw	2024-09-05	Build site.
Rmd	457d560	hcvw	2024-09-05	wflow_publish(c("analysis/regression.Rmd"))
html	0cc6e3a	hcvw	2024-08-14	Build site.
Rmd	bcf9628	hcvw	2024-08-14	wflow_publish(c("analysis/regression.Rmd"))
html	45906fc	GitHub	2024-08-14	Add files via upload
Rmd	9728f48	GitHub	2024-08-14	Update regression.Rmd
html	e6972d6	hcvw	2024-06-25	Build site.
html	1eb6d2c	hcvw	2024-06-25	Build site.
Rmd	96b197a	hcvw	2024-06-25	wflow_publish(c("analysis/regression.Rmd"))

To estimate the number of people in CODIS by state and race, we need information on (1) the number of people in CODIS in each state, and (2) the racial composition of this number. We have several data sources, each of which provide different information for different states:

Direct data on the number of Black and White people in CODIS, from a Freedom of Information Act (FOIA) request conducted by Muprhy and Tong (2020) for the following states: California, Florida, Indiana, Maine, Nevada, South Dakota, and Texas. [1]
The number of people of each racial group in prison in each state. This data was pulled from a variety of sources and is available in Klein et al. (2023). [2]
The number of people in the State DNA Indexing System (SDIS) and the National DNA Indexing System (NDIS). NDIS data is publicly available for all 50 states [3], whereas SDIS data is only available on a state-by-state basis and is obtained from internet searches. Within each DNA indexing system, both the number of offenders and the number of arrestees are recorded.

To leverage this data to make estimates on the number of people in CODIS by state, we separate states into three categories:
1. States who have data available in the Murphy & Tong dataset. For these states, no calculations are needed to estimate the number of people in CODIS.
2. States who have NDIS and SDIS data available but are not in the Murphy & Tong dataset.
3. States with only NDIS data available.

The plot below shows the data that is available for each state:

For states in categories (2) and (3), we need to generate an estimation of the racial composition of the SDIS profiles. To generate this estimation, we use the Murphy & Tong FOIA data to create a regression model of proportion of Black and White people in the data set with the following independent variables: the U.S. census proportion of each state for each rate, the percent of the state’s prison population that is each race, an indicator variable for Black/White race, and interaction variables for census proportion by race and prison racial population by race. We use the coefficients from the regression model to make predictions for the remaining states that do not have data available on the racial composition of the DNA databases.

\[Proportion_{race} = \beta_0 + \beta_1census_{proportion} + \beta_2prison_{proportion} + \beta_3race + \beta_4race*census_{proportion} + \beta_5race*prison_{proportion}\]

For states in category (3) we need both an estimation of the number of people in SDIS, and an estimation of the number of the racial composition of the datasets. To estimate the racial composition of the database, we use the regression model described above. To generate predictions of the number of people in SDIS for these states, we create an additional regression model with dependent variable the number of people in SDIS and independent variables for the U.S. census proportion of the population that is each race, the proportion of the state’s prison population that is each race, and the number of people in NDIS for that state. We create separate regression models for arrests and offenders to obtain more accurate predictions:

\[N_{arrestee} = \beta_0 + \beta_1census_{black} + \beta_2census_{white} + \beta_3prison_{black} + \beta_4prison_{white} + \beta_5NDIS_{arrestees} \] and

\[N_{offender} = \beta_0 + \beta_1census_{black} + \beta_2census_{white} + \beta_3prison_{black} + \beta_4prison_{white} + \beta_5NDIS_{offenders} \]

Part 1: regression to estimate the racial composition of each database.

The following plot shows the coefficient estimates for the regression model that estimates the racial composition of the CODIS dataset using the Murphy & Tong states.

While none of the coefficients were significant in the regression, our model had an \(R^2\) value of 0.93, demonstrating a good fit. The following plot showing the estimated racial composition using the regression model vs. the true values for the states with available data. We also plot a difference plot (also known as a Bland-Altman plot).

Version	Author	Date
a60d243	hcvw	2024-09-05
1eb6d2c	hcvw	2024-06-25

Version	Author	Date
1eb6d2c	hcvw	2024-06-25

Part 2: SDIS regression

The following plots show the results of the SDIS regression for both arrestees (left) and offenders (right):

Part 3: final data

Using the above regression models, we make predictions of the number of Black and White people in CODIS by state. The plot below shows our estimates for each state, colored by the data source used for each state. The number of Black people in the database are indicated with circles and the number of White people is indicated by triangles.

We additionally generate a plot showing the difference in the percent of people in CODIS that are Black versus the U.S. census percent of the population that is Black in each state.

Version	Author	Date
a492d9c	hcvw	2024-09-06
a60d243	hcvw	2024-09-05

The table containing the estimates for the number of people of each race in CODIS by state, along with the source of the data is below:

            State Black Profiles White Profiles          Source
1         Alabama    181118.5149     217600.769 SDIS+regression
2          Alaska      5549.5882      39296.292 Regression only
3         Arizona     58944.9246     257202.483 Regression only
4        Arkansas      4031.2855       8680.932 SDIS+regression
5      California    473373.9990     819407.624          Murphy
6        Colorado     57866.4208     262106.908 SDIS+regression
7     Connecticut      5751.2303       7802.577 SDIS+regression
8        Delaware      4539.1416       5410.328 Regression only
9         Florida    475434.7840     829309.538          Murphy
10        Georgia    206442.0982     181496.282 SDIS+regression
11         Hawaii      1242.7095      11370.778 Regression only
12          Idaho       963.4824      34101.874 SDIS+regression
13       Illinois    253303.7558     305785.832 SDIS+regression
14        Indiana     80005.6400     215399.800          Murphy
15           Iowa     21824.2069     119945.100 Regression only
16         Kansas     44190.1327     210909.471 Regression only
17       Kentucky     41798.7993     217311.612 Regression only
18      Louisiana    432013.8926     338844.041 SDIS+regression
19          Maine      1281.0330      30482.016          Murphy
20       Maryland     78263.2834      46862.137 Regression only
21  Massachusetts     33450.8754      94768.533 Regression only
22       Michigan    230192.9750     376683.192 Regression only
23      Minnesota     37914.2174     115906.979 SDIS+regression
24    Mississippi     77616.9619      62226.769 SDIS+regression
25       Missouri    103504.8306     309407.963 SDIS+regression
26        Montana       641.1110      28312.414 SDIS+regression
27       Nebraska      8699.6015      34027.310 Regression only
28         Nevada     42937.8560     116401.844          Murphy
29  New Hampshire       688.2462      14974.579 Regression only
30     New Jersey     12389.6261      11436.627 SDIS+regression
31     New Mexico      8742.4332     138405.866 Regression only
32       New York    249297.7042     253243.373 Regression only
33 North Carolina    151156.0304     193565.650 SDIS+regression
34   North Dakota      3481.9052      28239.746 Regression only
35           Ohio    313483.4690     653979.072 Regression only
36       Oklahoma     44903.8916     149971.344 Regression only
37         Oregon     13690.7917     202325.147 Regression only
38   Pennsylvania    156674.3833     288601.601 Regression only
39   Rhode Island      5961.4416       8911.275 SDIS+regression
40 South Carolina     91029.7856      91493.358 SDIS+regression
41   South Dakota      4056.0000      45156.800          Murphy
42      Tennessee    156539.3105     321673.841 Regression only
43          Texas    279646.6350     358447.405          Murphy
44           Utah      6663.1740     106665.663 Regression only
45        Vermont       680.5597      12917.961 Regression only
46       Virginia    218711.3900     275687.781 SDIS+regression
47     Washington     31472.5629     235805.829 SDIS+regression
48  West Virginia      4234.0180      47855.252 SDIS+regression
49      Wisconsin     95752.7058     279276.555 Regression only
50        Wyoming       668.1218      18897.896 Regression only

Finally, we generate side-by-side pie charts for each state showing the racial composition according to the census (left) versus the estimated racial composition of CODIS (right) for each state. Note that groups not identifying as Black or White are omitted for easy comparison.

References

[1] Murphy, Erin, and Jun H. Tong. “The racial composition of forensic DNA databases.” Calif. L. Rev. 108 (2020): 1847.

[2] Klein, Brennan, et al. “COVID-19 amplified racial disparities in the US criminal legal system.” Nature 617.7960 (2023): 344-350.

[3] https://le.fbi.gov/science-and-lab/biometrics-and-fingerprints/codis/codis-ndis-statistics

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Detroit
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sf_1.0-17         viridis_0.6.5     viridisLite_0.4.2 cowplot_1.1.3    
 [5] tidycensus_1.6.5  sandwich_3.1-1    ggpubr_0.6.0      jtools_2.3.0     
 [9] knitr_1.48        lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1    
[13] dplyr_1.1.4       purrr_1.0.2       tidyr_1.3.1       tibble_3.2.1     
[17] ggplot2_3.5.1     tidyverse_2.0.0   readr_2.1.5       workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    farver_2.1.2        fastmap_1.2.0      
 [4] promises_1.3.0      digest_0.6.36       timechange_0.3.0   
 [7] lifecycle_1.0.4     processx_3.8.4      magrittr_2.0.3     
[10] compiler_4.4.1      rlang_1.1.4         sass_0.4.9         
[13] tools_4.4.1         utf8_1.2.4          yaml_2.3.9         
[16] ggsignif_0.6.4      labeling_0.4.3      curl_5.2.1         
[19] classInt_0.4-10     xml2_1.3.6          KernSmooth_2.23-24 
[22] abind_1.4-8         withr_3.0.0         grid_4.4.1         
[25] fansi_1.0.6         git2r_0.33.0        e1071_1.7-14       
[28] colorspace_2.1-0    future_1.33.2       globals_0.16.3     
[31] scales_1.3.0        cli_3.6.3           crayon_1.5.3       
[34] rmarkdown_2.27      generics_0.1.3      rstudioapi_0.16.0  
[37] httr_1.4.7          tzdb_0.4.0          proxy_0.4-27       
[40] DBI_1.2.3           cachem_1.1.0        pander_0.6.5       
[43] splines_4.4.1       rvest_1.0.4         parallel_4.4.1     
[46] tigris_2.1          vctrs_0.6.5         jsonlite_1.8.8     
[49] carData_3.0-5       car_3.1-2           callr_3.7.6        
[52] hms_1.1.3           rstatix_0.7.2       listenv_0.9.1      
[55] jquerylib_0.1.4     units_0.8-5         glue_1.7.0         
[58] parallelly_1.37.1   codetools_0.2-20    ps_1.7.7           
[61] stringi_1.8.4       gtable_0.3.5        later_1.3.2        
[64] broom.mixed_0.2.9.5 munsell_0.5.1       furrr_0.3.1        
[67] pillar_1.9.0        rappdirs_0.3.3      htmltools_0.5.8.1  
[70] R6_2.5.1            rprojroot_2.0.4     evaluate_0.24.0    
[73] lattice_0.22-6      highr_0.11          backports_1.5.0    
[76] broom_1.0.6         httpuv_1.6.15       bslib_0.7.0        
[79] class_7.3-22        uuid_1.2-0          Rcpp_1.0.12        
[82] gridExtra_2.3       nlme_3.1-164        whisker_0.4.1      
[85] xfun_0.45           fs_1.6.4            zoo_1.8-12         
[88] getPass_0.2-4       pkgconfig_2.0.3

Regression to predict CODIS database proportions

Hannah Van Wyk

2024-09-16 20:46:23

Part 1: regression to estimate the racial composition of each database.

Part 2: SDIS regression

Part 3: final data

References