Last updated: 2024-10-04
Checks: 6 1
Knit directory: PODFRIDGE/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of
the R Markdown file created these results, you’ll want to first commit
it to the Git repo. If you’re still working on the analysis, you can
ignore this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20230302)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 7ec169f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Unstaged changes:
Modified: analysis/relative-distribution.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/relative-distribution.Rmd
)
and HTML (docs/relative-distribution.html
) files. If you’ve
configured a remote Git repository (see ?wflow_git_remote
),
click on the hyperlinks in the table below to view the files as they
were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 7ec169f | linmatch | 2024-10-03 | update sibling distribution |
Rmd | 27b986f | linmatch | 2024-10-01 | update stability cohort analysis |
Rmd | 57a97db | linmatch | 2024-09-30 | Update relative-distribution.Rmd |
Rmd | 52aa7f8 | linmatch | 2024-09-27 | improving code on Part1 |
Rmd | c54a746 | linmatch | 2024-09-26 | update and fix step 2 |
Rmd | 94d00b3 | linmatch | 2024-09-24 | fix step1 |
Rmd | 1225480 | linmatch | 2024-09-23 | update on step1 |
Rmd | 83174c0 | Tina Lasisi | 2024-09-22 | Instructions and layout for relative distribution |
html | 83174c0 | Tina Lasisi | 2024-09-22 | Instructions and layout for relative distribution |
Rmd | 78c1621 | Tina Lasisi | 2024-09-21 | Update relative-distribution.Rmd |
Rmd | f4c2830 | Tina Lasisi | 2024-09-21 | Update relative-distribution.Rmd |
Rmd | 2851385 | Tina Lasisi | 2024-09-21 | rename old analysis |
html | 2851385 | Tina Lasisi | 2024-09-21 | rename old analysis |
The relative genetic surveillance of a population is influenced by the number of genetically detectable relatives individuals have. First-degree relatives (parents, siblings, and children) are especially relevant in forensic analyses using short tandem repeat (STR) loci, where close familial searches are commonly employed. To explore potential disparities in genetic detectability between African American and European American populations, we examined U.S. Census data from four census years (1960, 1970, 1980, and 1990) focusing on the number of children born to women over the age of 40.
We used publicly available data from the Integrated Public Use Microdata Series (IPUMS) for the U.S. Census years 1960, 1970, 1980, and 1990. The datasets include information on:
Data citation: Steven Ruggles, Sarah Flood, Matthew Sobek, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Renae Rogers, and Megan Schouweiler. IPUMS USA: Version 14.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D010.V14.0
Filtering Criteria: We selected women aged 40 and above to ensure that most had completed childbearing.
Due to the terms of agreement for using this data, we cannot share the full dataset but our repo contains the subset that was used to calculate the mean number of offspring and variance.
Race Classification: We categorized individuals into two groups:
Calculating Number of Siblings: For each child of these women, the number of siblings (n_sib) is one less than the number of children born to the mother:
\[ n_{sib} = chborn_{num} - 1 \]
path <- file.path(".", "data")
savepath <- file.path(".", "output")
prop_race_year <- file.path(path, "proportions_table_by_race_year.csv")
data_filter <- file.path(path, "data_filtered_recoded.csv")
children_data = read.csv(prop_race_year)
mother_data = read.csv(data_filter)
df <- mother_data %>%
# Filter for women aged 40 and above
filter(AGE >= 40) %>%
mutate(
# Create new age ranges
AGE_RANGE = case_when(
AGE >= 70 ~ "70+",
AGE >= 60 ~ "60-69",
AGE >= 50 ~ "50-59",
AGE >= 40 ~ "40-49",
TRUE ~ as.character(AGE_RANGE) # This shouldn't occur due to the filter, but included for completeness
),
# Convert CHBORN to ordered factor
CHBORN = factor(case_when(
chborn_num == 0 ~ "No children",
chborn_num == 1 ~ "1 child",
chborn_num == 2 ~ "2 children",
chborn_num == 3 ~ "3 children",
chborn_num == 4 ~ "4 children",
chborn_num == 5 ~ "5 children",
chborn_num == 6 ~ "6 children",
chborn_num == 7 ~ "7 children",
chborn_num == 8 ~ "8 children",
chborn_num == 9 ~ "9 children",
chborn_num == 10 ~ "10 children",
chborn_num == 11 ~ "11 children",
chborn_num >= 12 ~ "12+ children"
), levels = c("No children", "1 child", "2 children", "3 children",
"4 children", "5 children", "6 children", "7 children",
"8 children", "9 children", "10 children", "11 children",
"12+ children"), ordered = TRUE),
# Ensure RACE variable is correctly formatted and filtered
RACE = factor(RACE, levels = c("White", "Black/African American"))
) %>%
# Filter for African American and European American women
filter(RACE %in% c("Black/African American", "White")) %>%
# Select and reorder columns
dplyr::select(YEAR, SEX, AGE, BIRTHYR, RACE, CHBORN, AGE_RANGE, chborn_num)
# Display the first few rows of the processed data
head(df)
YEAR SEX AGE BIRTHYR RACE CHBORN AGE_RANGE chborn_num
1 1960 Female 65 1894 White No children 60-69 0
2 1960 Female 49 1911 White 2 children 40-49 2
3 1960 Female 54 1905 White No children 50-59 0
4 1960 Female 56 1903 White 1 child 50-59 1
5 1960 Female 54 1905 White 1 child 50-59 1
6 1960 Female 50 1910 White No children 50-59 0
# Summary of the processed data
summary(df)
YEAR SEX AGE BIRTHYR
Min. :1960 Length:1917477 Min. : 41.00 Min. :1859
1st Qu.:1970 Class :character 1st Qu.: 49.00 1st Qu.:1905
Median :1970 Mode :character Median : 58.00 Median :1916
Mean :1976 Mean : 59.24 Mean :1916
3rd Qu.:1980 3rd Qu.: 68.00 3rd Qu.:1926
Max. :1990 Max. :100.00 Max. :1949
RACE CHBORN AGE_RANGE
White :1740755 2 children :452594 Length:1917477
Black/African American: 176722 No children:343319 Class :character
3 children :335119 Mode :character
1 child :292001
4 children :204983
5 children :113593
(Other) :175868
chborn_num
Min. : 0.00
1st Qu.: 1.00
Median : 2.00
Mean : 2.57
3rd Qu.: 4.00
Max. :12.00
# Check the levels of the RACE factor
levels(df$RACE)
[1] "White" "Black/African American"
# Count of observations by RACE
table(df$RACE)
White Black/African American
1740755 176722
# Count of observations by AGE_RANGE
table(df$AGE_RANGE)
40-49 50-59 60-69 70+
529160 517620 436829 433868
First we visualize the general trends in the frequency of the number of children for African American and European American mothers across the Census years by age group.
# Calculate proportions within each group, ensuring proper normalization
df_proportions <- df %>%
group_by(YEAR, RACE, AGE_RANGE, chborn_num) %>%
summarise(count = n(), .groups = "drop") %>%
group_by(YEAR, RACE, AGE_RANGE) %>%
mutate(proportion = count / sum(count)) %>%
ungroup()
# Reshape data for the mirror plot
df_mirror <- df_proportions %>%
mutate(proportion = if_else(RACE == "White", -proportion, proportion))
# Create color palette
my_colors <- colorRampPalette(c("#FFB000", "#F77A2E", "#DE3A8A", "#7253FF", "#5E8BFF"))(13)
# Create the plot
ggplot(df_mirror, aes(x = chborn_num, y = proportion, fill = as.factor(chborn_num))) +
geom_col(aes(alpha = RACE)) +
geom_hline(yintercept = 0, color = "black", size = 0.5) +
facet_grid(AGE_RANGE ~ YEAR, scales = "free_y") +
coord_flip() +
scale_y_continuous(
labels = function(x) abs(x),
limits = function(x) c(-max(abs(x)), max(abs(x)))
) +
scale_x_continuous(breaks = 0:12, labels = c(0:11, "12+")) +
scale_fill_manual(values = my_colors) +
scale_alpha_manual(values = c("White" = 0.7, "Black/African American" = 1)) +
labs(
title = "Distribution of Number of Children by Census Year, Race, and Age Range",
x = "Number of Children",
y = "Proportion",
fill = "Number of Children",
caption = "White population shown on left (negative values), Black/African American on right (positive values)\nProportions normalized within each age range, race, and census year\nFootnote: The category '12+' includes families with 12 or more children."
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, hjust = 0.5),
axis.text.y = element_text(size = 8),
strip.text = element_text(size = 10),
legend.position = "none",
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()
)
# Save the plot (optional)
# ggsave("children_distribution_mirror_plot.png", width = 20, height = 25, units = "in", dpi = 300)
# Print summary to check age ranges and normalization
print(df_proportions %>%
group_by(YEAR, RACE, AGE_RANGE) %>%
summarise(total_proportion = sum(proportion), .groups = "drop") %>%
arrange(YEAR, RACE, AGE_RANGE))
# A tibble: 32 × 4
YEAR RACE AGE_RANGE total_proportion
<int> <fct> <chr> <dbl>
1 1960 White 40-49 1
2 1960 White 50-59 1
3 1960 White 60-69 1
4 1960 White 70+ 1
5 1960 Black/African American 40-49 1
6 1960 Black/African American 50-59 1
7 1960 Black/African American 60-69 1
8 1960 Black/African American 70+ 1
9 1970 White 40-49 1
10 1970 White 50-59 1
# ℹ 22 more rows
With this visualization of the distribution of the data, we can see that there are differences between White and Black Americans.
We will now test for differences in 1) the mean and variance, and 2) zero-inflation.
Question 1) What is the best model for the distribution of the data in each of these subsets of data (by race, by census year, by age range combined)?
For each combination of race, census year, and age range, fit your candidate models:
This should be done for each subset of the data (race × census year × age range).
Understanding the Data:
Candidate Models for Count Data:
Recommended Approach:
1: Exploratory Data Analysis (EDA)
2: Model Selection
Scenario 1: Overdispersion Without Excess Zeros
Scenario 2: Overdispersion With Excess Zeros
Scenario 3: No Overdispersion, No Excess Zeros
3: Fit Models to Each Subset
Once the models are fitted, compare them using goodness-of-fit criteria like AIC or BIC for each subset. The model with the lowest AIC/BIC is the best fit for that subset.
## Approach 1:
combinations <- df %>%
group_by(YEAR, RACE, AGE_RANGE) %>%
group_split()
# Initialize the data frame for storing results
EDA_df <- data.frame(
Race = character(),
Census_Year = numeric(),
Age_Range = character(),
Overdispersion = logical(),
Zero_Inflation = logical(),
stringsAsFactors = FALSE
)
# Initialize vectors for overdispersion and zero-inflation
overdispersion <- logical(length(combinations))
zero_inflation <- logical(length(combinations))
for (i in seq_along(combinations)) {
subset_data <- combinations[[i]]
mean_chborn <- mean(subset_data$chborn_num)
var_chborn <- var(subset_data$chborn_num)
# Check for overdispersion (variance > mean)
overdispersion[i] <- var_chborn > mean_chborn
# Check for zero-inflation
prop_zero <- sum(subset_data$chborn_num == 0) / nrow(subset_data)
# Compare prop_zero to expected under Poisson
exp_poisson <- exp(-mean_chborn)
# Compare prop_zero to expected under Negative Binomial
standard_nb <- glm.nb(chborn_num ~ 1, data = subset_data)
theta_nb <- standard_nb$theta
exp_zero_nb <- (theta_nb / (theta_nb + mean_chborn))^theta_nb
zero_inflation[i] <- prop_zero > exp_poisson | prop_zero > exp_zero_nb
# Store the results in the EDA_df
EDA_df <- rbind(
EDA_df,
data.frame(
Race = unique(subset_data$RACE),
Census_Year = unique(subset_data$YEAR),
Age_Range = unique(subset_data$AGE_RANGE),
Overdispersion = overdispersion[i],
Zero_Inflation = zero_inflation[i],
stringsAsFactors = FALSE
)
)
}
EDA_df$model <- NA # Create an empty column for the model
for(i in 1:nrow(EDA_df)) {
if(EDA_df$Overdispersion[i] & EDA_df$Zero_Inflation[i]) {
EDA_df$model[i] <- "Zero-Inflated Negative Binomial"
} else if(EDA_df$Overdispersion[i] & !EDA_df$Zero_Inflation[i]) {
EDA_df$model[i] <- "Negative Binomial"
} else if(!EDA_df$Overdispersion[i] & !EDA_df$Zero_Inflation[i]) {
EDA_df$model[i] <- "Poisson"
} else if(!EDA_df$Overdispersion[i] & EDA_df$Zero_Inflation[i]){
EDA_df$model[i] <- "Zero-Inflated Poisson"
}
}
# View the final EDA_df
print(EDA_df)
## Approach 2
# Initialize the data frame to store results
best_models <- data.frame(
Race = character(),
Census_Year = numeric(),
Age_Range = character(),
AIC_poisson = numeric(),
AIC_nb = numeric(),
AIC_zip = numeric(),
AIC_zinb = numeric(),
stringsAsFactors = FALSE
)
# Loop through each subset of data
for (i in seq_along(combinations)) {
subset_data <- combinations[[i]]
# Fit Poisson model
poisson_model <- glm(chborn_num ~ 1, family = poisson, data = subset_data)
# Fit Negative Binomial model
nb_model <- glm.nb(chborn_num ~ 1, data = subset_data)
# Fit Zero-Inflated Poisson model
zip_model <- zeroinfl(chborn_num ~ 1 | 1, data = subset_data, dist = "poisson")
# Fit Zero-Inflated Negative Binomial model
zinb_model <- zeroinfl(chborn_num ~ 1 | 1, data = subset_data, dist = "negbin")
# Append the result to the best_models data frame
best_models <- rbind(
best_models,
data.frame(
Race = unique(subset_data$RACE),
Census_Year = unique(subset_data$YEAR),
Age_Range = unique(subset_data$AGE_RANGE),
AIC_poisson = AIC(poisson_model),
AIC_nb = AIC(nb_model),
AIC_zip = AIC(zip_model),
AIC_zinb = AIC(zinb_model),
stringsAsFactors = FALSE
)
)
}
# Add a Best_Model column to store the best model based on minimum AIC
best_models$Best_Model <- apply(best_models, 1, function(row) {
# Get AIC values for Poisson, NB, ZIP, and ZINB models
aic_values <- c(Poisson = as.numeric(row['AIC_poisson']),
Negative_Binomial = as.numeric(row['AIC_nb']),
Zero_Inflated_Poisson = as.numeric(row['AIC_zip']),
Zero_Inflated_NB= as.numeric(row['AIC_zinb']))
# Find the name of the model with the minimum AIC value
best_model <- names(which.min(aic_values))
return(best_model)
})
best_models <- best_models %>% dplyr::select(Race, Census_Year, Age_Range, Best_Model)
# View the updated table with the Best_Model column
print(best_models)
Create a table or data frame where you store the best model for each combination of race, census year, and age range based on AIC/BIC. For example:
Race | Census Year | Age Range | Best Model |
---|---|---|---|
Black/African Am. | 1990 | 40-49 | Negative Binomial |
White | 1990 | 40-49 | Zero-Inflated NB |
White | 1980 | 50-59 | Poisson |
… | … | … | … |
(but obviously do this as a dataframe so we can analyze it)
Once you’ve gathered the best model for each subset, you can analyze which factor (race, census year, or age range) is most influential in determining the best-fitting model.
Options for statistical analysis:
#example
# Need large sample size, warning message "Chi-squared approximation may be incorrect," usually occurs when some of the expected counts in the contingency table are too small (typically less than 5)
## Create a contingency table
#table_model_race <- table(best_models$Race, best_models$Best_Model)
# Perform chi-square test
#chisq.test(table_model_race)
Logistic Regression: You could fit a logistic regression model where the response variable is the best-fitting model (binary or multinomial) and the predictor variables are race, census year, and age range. This will allow you to quantify the effect of each variable on the model choice.
Example in R (assuming binary outcome: Poisson vs. NB):
# Recode Best_Model to a binary variable (e.g., "Zero_Inflated_NB" vs. "Other")
best_models$Best_Model_Binary <- ifelse(best_models$Best_Model == "Zero_Inflated_NB", 1, 0)
# Fit binary logistic regression
model_logistic <- glm(Best_Model_Binary ~ Race + Census_Year + Age_Range, family = binomial(), data = best_models)
# Summary of the logistic regression model
summary(model_logistic)
Multinomial Logistic Regression (if more than two models): If you have multiple possible best-fitting models (Poisson, NB, ZIP, ZINB), you can use multinomial logistic regression to assess the impact of race, census year, and age range on model selection.
Example in R using the nnet
package:
## Not appropriate since only 2 levels in best_model
# Fit multinomial logistic regression
#model_multinom <- multinom(Best_Model ~ Race + Census_Year + Age_Range, data = best_model_data)
# Summary of results
#summary(model_multinom)
Finally, visualize the distribution of the best-fitting models across races, census years, and age groups.
Example of a simple bar plot in R:
# Bar plot showing the distribution of best models across races
ggplot(best_models, aes(x = Race, fill = Best_Model)) +
geom_bar(position = "dodge") +
labs(title = "Best-Fitting Models by Race", x = "Race", y = "Count", fill = "Best Model")
# Bar plot showing the distribution of best models across races
ggplot(best_models, aes(x = Census_Year, fill = Best_Model)) +
geom_bar(position = "dodge") +
labs(title = "Best-Fitting Models by Census Year", x = "Census Year", y = "Count", fill = "Best Model")
# Bar plot showing the distribution of best models across races
ggplot(best_models, aes(x = Age_Range, fill = Best_Model)) +
geom_bar(position = "dodge") +
labs(title = "Best-Fitting Models by Age Range", x = "Age Range", y = "Count", fill = "Best Model")
ggplot(best_models, aes(x = Census_Year, y = Age_Range, fill = Best_Model)) +
geom_tile(color = "white", lwd = 0.5, linetype = 1) + # Create the tiles
facet_wrap(~Race, nrow = 2) + # Facet by Race
scale_fill_manual(values = c("Zero_Inflated_Poisson" = "#0072B2", "Zero_Inflated_NB" = "#F0E442")) +
labs(
title = "Best Model by Race, Census Year, and Age Range",
x = "Census Year",
y = "Age Range",
fill = "Best Model"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels for readability
plot.title = element_text(size = 14, face = "bold"),
legend.position = "bottom"
)
For Question 2: Cohort Stability Analysis, the goal is to determine if there is a significant change in zero inflation, family size, or model fit for the same cohort across different census years, given race.
Using the existing table of model summaries, calculate the birth-year cohort based on the Census Year and Age Range. For example, if a woman is in the 40-49 age range in the 1990 Census, her birth cohort would be 1941-1950.
Cohort
.Example Table:
Race | Census Year | Age Range | Best Model | Cohort |
---|---|---|---|---|
Black/African Am. | 1990 | 40-49 | Negative Binomial | 1941-1950 |
White | 1990 | 40-49 | Zero-Inflated NB | 1941-1950 |
White | 1980 | 50-59 | Poisson | 1921-1930 |
… | … | … | … | … |
calculate_cohort <- function(census_year, age_range) {
# Extract the lower and upper age limits from the Age_Range
age_limits <- as.numeric(unlist(strsplit(age_range, "-|\\+")))
# If the age range is "70+", treat the upper bound as 70+ (i.e., infinity)
if (length(age_limits) == 1) {
lower_age <- age_limits
upper_age <- Inf # For ages 70+, we assume no upper bound
} else {
lower_age <- age_limits[1]
upper_age <- age_limits[2]
}
# Calculate birth cohort bounds
lower_cohort <- census_year - upper_age
upper_cohort <- census_year - lower_age
# If it's the 70+ age range, we'll assume the lower bound goes back to a reasonable limit
if (upper_age == Inf) {
upper_cohort <- census_year - 70
}
# Return the birth cohort as a string (e.g., "1941-1950")
return(paste(lower_cohort, upper_cohort, sep = "-"))
}
# Apply this function to your best_models table to create a new 'Cohort' column
best_models <- best_models %>%
mutate(Cohort = mapply(calculate_cohort, Census_Year, Age_Range)) %>%
dplyr::select(Race, Census_Year, Age_Range, Best_Model, Cohort)
# View the updated table
print(best_models)
For each combination of Race, Cohort, and Census Year, compute additional summary statistics for family size distributions: - Mean, Variance, Mode of the number of children. - Probability of having 0, 1, 2, …, 12+ children (empirical or from the best-fitting model).
These statistics will help quantify family size changes over time.
Example Summary Table:
Race | Cohort | Census Year | Best Model | Mean | Variance | Mode | Prob(0) | Prob(1) | Prob(2) | … | Prob(12+) |
---|---|---|---|---|---|---|---|---|---|---|---|
Black/African Am. | 1941-1950 | 1990 | Negative Binomial | 2.5 | 1.2 | 2 | 0.20 | 0.30 | 0.25 | … | 0.01 |
White | 1941-1950 | 1990 | Zero-Inflated NB | 2.1 | 1.0 | 2 | 0.35 | 0.28 | 0.20 | … | 0.01 |
White | 1921-1930 | 1980 | Poisson | 3.0 | 1.5 | 3 | 0.15 | 0.25 | 0.30 | … | 0.05 |
… | … | … | … | … | … | … | … | … | … | … | … |
df_merg <- df %>%
rename(Census_Year = YEAR, Race = RACE, Age_Range = AGE_RANGE)
# Perform a left join to merge the data frames
merged_data <- left_join(df_merg, best_models, by = c("Race", "Census_Year", "Age_Range"))
calculate_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
summary_table <- merged_data %>% group_by(Race, Cohort, Census_Year) %>%
summarise(
Mean = mean(chborn_num),
Variance = var(chborn_num),
Mode = calculate_mode(chborn_num),
Prob_0 = mean(chborn_num == 0),
Prob_1 = mean(chborn_num == 1),
Prob_2 = mean(chborn_num == 2),
Prob_3 = mean(chborn_num == 3),
Prob_4 = mean(chborn_num == 4),
Prob_5 = mean(chborn_num == 5),
Prob_6 = mean(chborn_num == 6),
Prob_7 = mean(chborn_num == 7),
Prob_8 = mean(chborn_num == 8),
Prob_9 = mean(chborn_num == 9),
Prob_10 = mean(chborn_num == 10),
Prob_11 = mean(chborn_num == 11),
Prob_12plus = mean(chborn_num >= 12)
#Total_Prob = Prob_0 + Prob_1 + Prob_2 + Prob_3 + Prob_4 + Prob_5 + Prob_6 +
# Prob_7 + Prob_8 + Prob_9 + Prob_10 + Prob_11 + Prob_12plus
) %>%
ungroup() %>% as.data.frame()
# View the summary table
print(summary_table)
Once the cohort information and summary statistics are available, the goal is to determine if there is a significant change in any of the following across census years for the same cohort:
Use a logistic regression model to assess if the probability of having zero children changes significantly over time for the same cohort.
Example Code:
# Fit logistic regression to check if zero-inflation changes over time
zero_inflation_model <- glm(Prob_0 ~ Census_Year + Race + Cohort,
family = binomial(), data = summary_table)
# Summary of the model
summary(zero_inflation_model)
Result: The p-value of “Census_Year” is greater than 0.05, suggesting no statistically significance. Therefore, the probability of having zero children does not change significantly over time for the same cohort.
Use an ANOVA or linear regression to test if the mean or variance in family size changes over census years for the same cohort.
Example Code:
# Fit linear model to check for changes in family size over time
family_size_model <- lm(Mean ~ Census_Year + Race + Cohort, data = summary_table)
# ANOVA to check for significant differences in mean
anova(family_size_model)
Result: The p-value of “Census_Year” is greater than 0.05, suggesting no statistically significance. Therefore, the mean in family size does not change over census years for the same cohort.
If the best-fitting model changes for the same cohort over time, this suggests that the distribution of family sizes has shifted. Use multinomial logistic regression to test if the model fit differs across census years for the same cohort.
Example Code:
cohort_data <- summary_table %>%
left_join(best_models, by = c("Race", "Census_Year", "Cohort")) %>%
mutate(Best_Model_Binary = ifelse(Best_Model == "Zero_Inflated_NB", 1, 0))
# Fit multinomial logistic regression to test changes in best-fitting model
model_fit_analysis <- glm(Best_Model_Binary ~ Census_Year + Race + Cohort, data = cohort_data)
# Summary of the model
summary(model_fit_analysis)
Result: The p-value of “Census_Year” is greater than 0.05, suggesting no statistically significant effect on choice between best models.
For each cohort, you will assess whether: - Zero-inflation changes significantly over time (i.e., whether the probability of having no children decreases or increases across census years). - Family size (mean, variance, mode) changes significantly over time. - The best-fitting model changes over time, indicating a shift in family size distributions.
If significant changes are detected for a cohort (e.g., the same cohort switches from a zero-inflated model to a non-zero-inflated model, or family sizes decrease), this indicates a shift in the demographic pattern for that group.
This analysis will allow you to evaluate cohort stability by testing whether demographic patterns (e.g., zero-inflation, family size) are consistent over time for the same cohort or if there are notable shifts. If the patterns change significantly, you will be able to identify when and how the demographic trends for each cohort begin to deviate from their earlier distribution.
For Question 3: Significant Fertility shifts, the goal is to summarize the information in the top 2 questions and create a comprehensive analysis and visualization that illustrates significant fertility shifts in cohorts, compares fertility patterns of 40-49 year-olds to 50-59 year-olds in the 1990 census so we can pick the set of fertility distributions we want to use to visualize the sibling distribution and do the math on the genetic surveillance.
Create a comprehensive analysis and visualization that:
The steps below are suggestions: use critical thinking and your judgement to complete this task and answer the question
chborn_num
)AGE_RANGE
)RACE
)Design visualizations that effectively communicate your findings.
Conduct statistical tests to determine if observed differences are significant.
Write a clear and concise summary addressing the following points.
Having analyzed the distribution of the number of children, we now turn our attention to the distribution of the number of siblings. We will explore the trends in the frequency of the number of siblings for African American and European American mothers across the Census years by age group.
Frequency of siblings is calculated as follows.
\[ \text{freq}_{n_{\text{sib}}} = \text{freq}_{\text{mother}} \cdot \text{chborn}_{\text{num}} \]
For example, suppose 10 mothers (generation 0) have 7 children, then there will be 70 children (generation 1) in total who each have 6 siblings.
We take our original data and calculate the frequency of siblings for each mother based on the number of children they have. We then aggregate this data to get the frequency of siblings for each generation along with details on the birthyears of the relevant children to visualize the distribution of the number of siblings across generations.
Create a mirror plot for the distribution of siblings across census years, races, and birth ranges, similar to the one created for the number of children.
birth_range
column.df2 <- df %>%
mutate(birth_range = mapply(calculate_cohort, YEAR, AGE_RANGE)) %>%
dplyr::select(RACE, YEAR, AGE_RANGE, birth_range, chborn_num)
df2 <- df2 %>%
mutate(n_siblings = chborn_num - 1)
df2 <- df2 %>%
mutate(sibling_freq = n_siblings * 1) # Assuming each mother represents 1 in frequency
df_siblings <- df2 %>%
group_by(YEAR, RACE, birth_range, n_siblings) %>%
summarise(
sibling_count = sum(sibling_freq),
.groups = "drop"
)
df_sibling_proportions <- df_siblings %>%
group_by(YEAR, RACE, birth_range) %>%
mutate(proportion = sibling_count / sum(sibling_count)) %>%
ungroup()
Before creating the plot, verify that the proportions are correctly normalized:
normalization_check <- df_sibling_proportions %>%
group_by(YEAR, RACE, birth_range) %>%
summarise(total_proportion = sum(proportion), .groups = "drop") %>%
arrange(YEAR, RACE, birth_range)
print(normalization_check)
Ensure that the total_proportion
for each group is very
close to 1.0. If not, revisit your calculations in Step 2.
Reshape data for the mirror plot:
df_sibling_mirror <- df_sibling_proportions %>%
mutate(proportion = if_else(RACE == "White", -proportion, proportion))
my_colors <- colorRampPalette(c("#FFB000", "#F77A2E", "#DE3A8A", "#7253FF", "#5E8BFF"))(13)
ggplot(df_sibling_mirror, aes(x = n_siblings, y = proportion, fill = as.factor(n_siblings))) +
geom_col(aes(alpha = RACE)) +
geom_hline(yintercept = 0, color = "black", size = 0.5) +
facet_grid(birth_range ~ YEAR, scales = "free_y") +
coord_flip() +
scale_y_continuous(
labels = function(x) abs(x),
limits = function(x) c(-max(abs(x)), max(abs(x)))
) +
scale_x_continuous(breaks = 0:12, labels = c(0:11, "12+")) +
scale_fill_manual(values = my_colors) +
scale_alpha_manual(values = c("White" = 0.7, "Black/African American" = 1)) +
labs(
title = "Distribution of Number of Siblings by Census Year, Race, and Birth Range",
x = "Number of Siblings",
y = "Proportion",
fill = "Number of Siblings",
caption = "White population shown on left (negative values), Black/African American on right (positive values)\nProportions normalized within each birth range, race, and census year\nFootnote: The category '12+' includes individuals with 12 or more siblings."
) +
theme_minimal() +
theme(
plot.title = element_text(size = 14, hjust = 0.5),
axis.text.y = element_text(size = 8),
strip.text = element_text(size = 10),
legend.position = "none",
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()
)
After creating the plot, examine it carefully:
Look for patterns in sibling distribution across different census years and birth ranges. Result: -By Census Year: We can observe a decrease in family size over time. The distribution of centered toward smaller number of siblings. By 1990, most of the distribution are more compact, with a peak around 2–4 siblings, and the tail of the distribution (larger numbers of siblings) has reduced significantly. -By birth range:For individuals born in earlier periods like pre-1900 or 1910, the distribution is skewed toward larger families, with more people having a larger number of siblings.As we move to more recent birth ranges, the distributions shift toward smaller family sizes.
Compare the distributions between White and Black/African American populations. Result: In general, the White population tends to have a more concentrated distribution around 2–4 siblings in later census years, while the Black/African American population often shows a slightly more spread-out distribution. However, the trend of decreasing family size over time is visible in both populations, though it appears more pronounced for the White population.
Note any significant changes or trends in the number of siblings over time. Result: -Decrease in Family Size: Both distribution of children and siblings indicate a clear societal trend toward smaller family sizes over time, with fewer people having a large number of siblings or children. -Cohort Effects: Individuals born earlier (pre-1900 or early 1900s) tend to come from larger families, while those born later (1930s-1950s) tend to have fewer siblings and children.
Consider how these sibling distributions might differ from the children distributions you analyzed earlier, and think about potential reasons for these differences. Result: While both distributions show a trend toward smaller families, the sibling distribution is more spread out across different sibling counts. This might be due to the calculation difference in proportion between number of children and number of siblings. For example, if a family have 12 children, then all of these 12 children have 11 siblings, leading to
Repeat the model fitting process we performed for the children distribution, this time using the sibling distribution data.
Prepare the Sibling Data Use the sibling distribution data you
calculated earlier (df_sibling_proportions
).
Fit Models For each combination of RACE, YEAR, and birth_range, fit the following distributions:
Use the same R functions and packages as in the children distribution analysis.
Compare Model Fits Use AIC (Akaike Information Criterion) to compare the fits of different models for each group.
Identify Best-Fitting Models Determine which model fits best for each combination of RACE, YEAR, and birth_range.
Analyze Patterns
Visualize Results Create a summary plot showing the best-fitting models across years and birth ranges for each race, similar to the one created for the children distribution.
Interpret Findings
Analyze the stability of sibling distributions across cohorts, similar to the analysis performed for children.
Use the sibling distribution data
(df_sibling_proportions
).
For each combination of RACE and birth_range:
Create a summary table showing:
Visualize stability:
Interpret results:
Create a visualization showing how sibling distributions change across census years for different races and birth ranges.
Use the sibling distribution data
(df_sibling_proportions
).
Create a multi-panel plot:
Use ggplot2
to create the plot:
ggplot(df_sibling_proportions, aes(x = n_siblings, y = proportion, color = factor(YEAR))) +
geom_line() +
facet_grid(RACE ~ birth_range) +
labs(title = "Sibling Distribution Across Census Years",
x = "Number of Siblings", y = "Proportion", color = "Census Year") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Remember to reference your findings from the children distribution analysis when discussing the results, highlighting any notable similarities or differences between the two analyses.
sessionInfo()