Last updated: 2018-09-16
workflowr checks: (Click a bullet for more information)wflow_publish
to commit the R Markdown file and build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
set.seed(20180914)
The command set.seed(20180914)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: code/.DS_Store
Ignored: data/.DS_Store
Ignored: data/derived/.DS_Store
Ignored: data/derived/output/.DS_Store
Ignored: data/input/.DS_Store
Ignored: figures/.DS_Store
Untracked files:
Untracked: code/Drosophila_GWAS.Rmd
Untracked: data/derived/output/male_late_lm,.assoc.txt
Untracked: data/derived/output/male_late_lm,.log.txt
Untracked: data/derived/trimmed_DGRP.bk
Unstaged changes:
Modified: analysis/get_predicted_line_means.Rmd
Modified: analysis/index.Rmd
Modified: analysis/perform_gwas.Rmd
Modified: analysis/plot_line_means.Rmd
Modified: data/derived/output/DGRP_GRM.log.txt
Modified: data/derived/output/female_early_bslmm.bv.txt
Modified: data/derived/output/female_early_bslmm.gamma.txt
Modified: data/derived/output/female_early_bslmm.hyp.txt
Modified: data/derived/output/female_early_bslmm.log.txt
Modified: data/derived/output/female_early_bslmm.param.txt
Modified: data/derived/output/female_early_female_late.log.txt
Modified: data/derived/output/female_early_lm.log.txt
Modified: data/derived/output/female_early_lmm.log.txt
Modified: data/derived/output/female_early_male_early.log.txt
Modified: data/derived/output/female_late_lm.log.txt
Modified: data/derived/output/female_late_lmm.log.txt
Modified: data/derived/output/female_late_male_late.log.txt
Modified: data/derived/output/male_early_lm.log.txt
Modified: data/derived/output/male_early_lmm.log.txt
Modified: data/derived/output/male_early_male_late.log.txt
Modified: data/derived/output/male_late_lmm.log.txt
Modified: figures/figure1.eps
Modified: figures/figure2.eps
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes. File | Version | Author | Date | Message |
---|---|---|---|---|
html | 449a929 | Luke Holman | 2018-09-14 | Build site. |
html | e981b22 | Luke Holman | 2018-09-14 | Made GWAS page |
Rmd | 64799c1 | Luke Holman | 2018-09-14 | Made get_predicted_line_means |
html | 64799c1 | Luke Holman | 2018-09-14 | Made get_predicted_line_means |
library(dplyr)
library(brms)
library(pander)
Most studies of the DGRP calculate so-called “line means” by averaging the phenotypes of several individuals per line. Instead, we used Bayesian generalised linear mixed model (GLMMs) to estimate the true line means for each of the 4 fitness traits.
One notable advantage of using a model instead of a simple average, is that this allows one to statistically remove confounding factors such as block effects, which would otherwise add noise to the line means. Note that we intentionally measured line 352 in every block, in order to have good precision to estimate the block effect.
Additionally, we must properly account for our experimental design in order to avoid pseudoreplication. Our design involved repeatedly measuring the same individuals to determine early- and late-life fitness.
For the progeny count data used to measure female fitness, we used a Poisson GLMM with the following formula:
Y=age+(age|line)+(age|block)+(1|vial.ID)
That is, we specify a random intercept for each vial, block and line. Additionally, we specify a random slope with age for each line and block, allowing for differential change in fitness with age between lines and blocks. The then use the fitted
function in the brms
package to derive the estimated mean progeny production for each combination of age and line. The line means were left on the scale of the linear predictor (as opposed to back-transforming them via the link function), in order to increase their normality. Furthermore, they were scaled to have a mean of zero and standard deviation of one.
Similarly, for our measure of male fitness (proportion of offspring sired), we used a binomial model with the following formula:
Y=age+(age|line)+(age|block)+(1|vial.ID)
Where Y is a 1 or 0 depending on whether the focal offspring was sired by a male from the focal DGRP line or a rival male, respectively.
load_and_tidy_fitness_data <- function(){
female_fitness <- read.csv("data/input/female_fitness.csv", stringsAsFactors = FALSE) %>%
arrange(line) %>%
mutate(vial.ID = paste("F", block, replicate, line, sep = "_"),
line = paste("line_", line, sep = "")) %>%
select(-num.F.late1, -num.F.late2, -num.laying.F.early, -replicate)
female_fitness$female.fitness.early[is.na(female_fitness$female.fitness.early)] <- 0
female_fitness$female.fitness.late[is.na(female_fitness$female.fitness.late)] <- 0
male_fitness <- read.csv("data/input/male_fitness.csv", stringsAsFactors = FALSE ) %>%
arrange(line) %>%
mutate(vial.ID = paste("M", block, replicate, line, sep = "_"),
line = paste("line_", line, sep = "")) %>%
select(-num.DGRP.males, -replicate)
# Save these tidy files, for data archiving
write.csv(female_fitness, file = "data/input/female_fitness_CLEANED.csv", row.names = FALSE)
write.csv(male_fitness, file = "data/input/male_fitness_CLEANED.csv", row.names = FALSE)
list(female_fitness = female_fitness, male_fitness = male_fitness)
}
# This function calculates the means for each line
get.line.means <- function(){
left_join(
load_and_tidy_fitness_data()[[1]] %>% select(line, female.fitness.early, female.fitness.late) %>%
group_by(line) %>%
summarise(female.fitness.early = mean(female.fitness.early),
female.fitness.late = mean(female.fitness.late)),
load_and_tidy_fitness_data()[[2]] %>%
group_by(line) %>%
summarise(early.male.rival = sum(early.male.rival),
early.male.focal = sum(early.male.focal),
late.male.rival = sum(late.male.rival),
late.male.focal = sum(late.male.focal),
male.fitness.early = early.male.focal / (early.male.focal + early.male.rival),
male.fitness.late = late.male.focal / (late.male.focal + late.male.rival)) %>%
select(line, male.fitness.early, male.fitness.late),
by = "line") %>% as.data.frame()
}
# A function to calculate each line's expected fitness, using brms models to adjust for block + vial random effects
# For the male late life assay, we also adjust for the number of competitor males that died
# (this assumes that their deaths happened randomly with respect to which line we were measuring)
get_predicted_line_means <- function(){
# If it's already calculated, just load up the file. Otherwise, make the file
out.file <- "data/derived/predicted_line_means.csv"
if(file.exists(out.file)) return(read.csv(out.file, stringsAsFactors = FALSE))
####### Load the individual-level data
# Load the female data and rearrange to long format
female_fitness <- load_and_tidy_fitness_data()[[1]] %>%
gather(age, fitness, female.fitness.early, female.fitness.late)
# Load the male data and make 5 imputed datasets using the package 'mice'
# (there are 2 vials with no early-male fitness, and 1 vial with no count of the number of GFP males)
# See this link for explanation: https://cran.r-project.org/web/packages/brms/vignettes/brms_missings.html
male_fitness <- load_and_tidy_fitness_data()[[2]]
male_fitness_imputed <- mice(male_fitness, m = 5, print = FALSE, seed = 1)
male_fitness_imputed <- lapply(1:5, function(i) complete(male_fitness_imputed, i))
male_fitness_imputed <- lapply(male_fitness_imputed, function(focal) {
focal %>% mutate(total1 = early.male.rival + early.male.focal,
total2 = late.male.rival + late.male.focal)})
####### Run model on female data
female_model <- brm(fitness ~ age + (age|line) + (age|block) + (1|vial.ID),
family = poisson, data = female_fitness,
iter = 4000, chains = 4, cores = 4,
seed = 1, control = list(adapt_delta = 0.999, max_treedepth = 15)) %>%
add_ic(ic = "R2")
writeLines(capture.output(summary(female_model)), con = "data/derived/model_summary_females.txt")
saveRDS(female_model, "data/derived/female_fitness_prediction_model.rds") # maybe delete later - probably not needed?
### Run model on male data
# Fit a non-linear smooth for num.GFP.males, but only for the late life data
# Inspection of the raw data and model results illustrates the non-linear smooth improves model fit appreciably
# For example, the 'wiggliness' of the smooth term is significantly non-zero
# See: https://www.fromthebottomoftheheap.net/2018/04/21/fitting-gams-with-brms/
male1 <- bf(early.male.focal | trials(total1) ~ (1|line) + (1|block) + (1|vial.ID))
male2 <- bf(late.male.focal | trials(total2) ~ s(num.GFP.males) + (1|line) + (1|block) + (1|vial.ID))
# Run 5 models, on each of the 5 imputed datasets, and combine the results
male_model <- brm_multiple(male1 + male2,
family = binomial, data = male_fitness_imputed,
iter = 4000, chains = 4, cores = 4,
seed = 1, control = list(adapt_delta = 0.999, max_treedepth = 15)) %>%
add_ic(ic = "R2")
writeLines(capture.output(summary(male_model)), con = "data/derived/model_summary_males.txt")
saveRDS(male_model, "data/derived/male_fitness_prediction_model.rds") # maybe delete later - probably not needed?
male_model <- readRDS("data/derived/male_fitness_prediction_model.rds")
####### Define new data for prediction
new.data.female <- female_fitness %>%
select(age, line) %>% distinct() %>% arrange(line, age)
new.data.male <- male_fitness %>%
select(line) %>% distinct() %>% arrange(line) %>%
mutate(total1 = 100, # I checked, and changing this number does not affect the line means, as I expected
total2 = 100,
num.GFP.males = male_fitness %>% .$num.GFP.males %>% mean(na.rm=TRUE))
# Get predicted lines means for early and late life female fitness on linear predictor scale (i.e. normal)
female_line_preds <- data.frame(new.data.female,
fitted(female_model,
newdata = new.data.female,
re_formula = "~ (age|line)",
scale = "linear")) %>%
select(age, line, Estimate) %>%
spread(age, Estimate) %>%
mutate(female.fitness.early = as.numeric(scale(female.fitness.early)),
female.fitness.late = as.numeric(scale(female.fitness.late)))
# Get predicted lines means for early and late life male fitness on linear predictor scale (i.e. normal)
male_line_preds <- data.frame(new.data.male,
fitted(male_model_NEW,
newdata = new.data.male,
re_formula = "~ (1|line)",
scale = "linear")) %>%
select(line, Estimate.earlymalefocal, Estimate.latemalefocal) %>%
mutate(male.fitness.early = as.numeric(scale(Estimate.earlymalefocal)),
male.fitness.late = as.numeric(scale(Estimate.latemalefocal))) %>%
select(line, male.fitness.early, male.fitness.late)
# Calculate the proportion of variance explained by line for females
## Get the posterior predicted values for each data point
predicted <- data.frame(female_fitness[complete.cases(female_fitness), ],
t(fitted(female_model, summary = FALSE)))
## Define a function to get the ICC for line, for each posterior sample of predictions
get_ICC_line <- function(predictions, age_class = NULL){
if(!is.null(age_class)) predictions <- predictions %>% filter(age == age_class)
first.col <- which(str_detect(names(predictions), "X"))[1]
sapply(0:7999, function(i){
ICCbare(predictions$line, predictions[, i+first.col])
})
}
# Calculate ICC for each sample, separately for old and young females
Line_ICC_female_early <- get_ICC_line(predicted, "female.fitness.early")
Line_ICC_female_late <- get_ICC_line(predicted, "female.fitness.late")
# Do the same for males
male_preds_early <- data.frame(line = male_fitness$line,
t(fitted(male_model, summary = FALSE)[,,1]))
male_preds_late <- data.frame(line = male_fitness$line,
t(fitted(male_model, summary = FALSE)[,,1]))
Line_ICC_male_early <- get_ICC_line(male_preds_early)
Line_ICC_male_late <- get_ICC_line(male_preds_late)
ICC <- posterior_summary(as.mcmc(data.frame(Line_ICC_female_early,
Line_ICC_female_late,
Line_ICC_male_early,
Line_ICC_male_late)))
write.csv(ICC, file = "data/derived/ICC.csv", row.names = TRUE)
# Record which block(s) each line was measured in. Generally one block, except for ref line (all blocks)
line_blocks <- rbind(female_fitness %>% select(block, line),
male_fitness %>% select(block, line)) %>%
mutate(block = replace(block, line == "352", "reference_line")) %>%
distinct() %>%
group_by(line) %>% summarise(block = paste0(block, collapse = ", "))
predicted_line_means <- female_line_preds %>%
left_join(male_line_preds, by = "line") %>%
left_join(line_blocks, by = "line") %>%
arrange(line) %>%
mutate(line = paste("line_", line, sep = ""))
write.csv(predicted_line_means, out.file, row.names = FALSE)
predicted_line_means %>% mutate(line = gsub("line_line_", "line_", line))
}
The object line_means
is the raw line means: unscaled, arithmetic means and in the original units.
The object predicted_line_means
is the estimated line means from the Bayesian models. These means are on the scale of the linear predictor (i.e. Poisson for females, binomial for males) rather than on the scale of the original response. They have been scaled to have a mean of 0 and a variance of 1.
line_means <- get.line.means()
predicted_line_means <- get_predicted_line_means()
cat(readLines('data/derived/model_summary_females.txt'), sep = '\n')
Family: poisson
Links: mu = log
Formula: fitness ~ age + (age | line) + (age | block) + (1 | vial.ID)
Data: female_fitness (Number of observations: 1500)
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
total post-warmup samples = 8000
ICs: LOO = NA; WAIC = NA; R2 = 0.94
Group-Level Effects:
~block (Number of levels: 9)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept) 0.32 0.12 0.16 0.62 5698 1.00
sd(agefemale.fitness.late) 1.29 0.43 0.74 2.35 4667 1.00
cor(Intercept,agefemale.fitness.late) -0.41 0.32 -0.88 0.33 8000 1.00
~line (Number of levels: 115)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept) 0.54 0.04 0.46 0.63 4294 1.00
sd(agefemale.fitness.late) 1.58 0.13 1.34 1.87 4293 1.00
cor(Intercept,agefemale.fitness.late) 0.41 0.10 0.21 0.58 3559 1.00
~vial.ID (Number of levels: 750)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept) 0.48 0.02 0.45 0.51 3005 1.00
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept 4.40 0.13 4.15 4.65 8000 1.00
agefemale.fitness.late -2.96 0.49 -3.96 -2.00 8000 1.00
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
cat(readLines('data/derived/model_summary_males.txt'), sep = '\n')
Family: MV(binomial, binomial)
Links: mu = logit
mu = logit
Formula: early.male.focal | trials(total1) ~ (1 | line) + (1 | block) + (1 | vial.ID)
late.male.focal | trials(total2) ~ s(num.GFP.males) + (1 | line) + (1 | block) + (1 | vial.ID)
Data: male_fitness_imputed (Number of observations: 750)
Samples: 20 chains, each with iter = 4000; warmup = 2000; thin = 5;
total post-warmup samples = 8000
ICs: LOO = NA; WAIC = NA; R2 = 0.98
Smooth Terms:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sds(latemalefocal_snum.GFP.males_1) 2.80 2.11 0.27 8.30 7683 1.00
Group-Level Effects:
~block (Number of levels: 9)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(earlymalefocal_Intercept) 0.73 0.24 0.41 1.34 7026 1.00
sd(latemalefocal_Intercept) 1.00 0.38 0.48 1.94 7293 1.00
~line (Number of levels: 115)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(earlymalefocal_Intercept) 0.57 0.06 0.47 0.68 7060 1.00
sd(latemalefocal_Intercept) 1.37 0.15 1.09 1.67 7202 1.00
~vial.ID (Number of levels: 750)
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(earlymalefocal_Intercept) 0.81 0.03 0.76 0.87 5630 1.01
sd(latemalefocal_Intercept) 1.91 0.08 1.76 2.08 6832 1.00
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
earlymalefocal_Intercept 0.30 0.27 -0.25 0.81 7744 1.00
latemalefocal_Intercept -1.17 0.39 -1.94 -0.37 7624 1.00
latemalefocal_snum.GFP.males_1 0.08 0.60 -1.08 1.43 7941 1.00
Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
is a crude measure of effective sample size, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
read.csv('data/derived/ICC.csv') %>%
rename(Parameter = X, Lower.95.CI = X2.5.ile, Upper.95.CI = X97.5.ile) %>% pander()
Parameter | Estimate | Est.Error | Lower.95.CI | Upper.95.CI |
---|---|---|---|---|
Line_ICC_female_early | 0.6044 | 0.007116 | 0.5903 | 0.6183 |
Line_ICC_female_late | 0.8514 | 0.006105 | 0.8392 | 0.863 |
Line_ICC_male_early | 0.5089 | 0.004727 | 0.4996 | 0.5182 |
Line_ICC_male_late | 0.5089 | 0.004727 | 0.4996 | 0.5182 |
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 pander_0.6.2 brms_2.4.0 ggplot2_3.0.0
[5] Rcpp_0.12.18 dplyr_0.7.6
loaded via a namespace (and not attached):
[1] Brobdingnag_1.2-5 R.utils_2.7.0 gtools_3.8.1
[4] StanHeaders_2.17.2 threejs_0.3.1 shiny_1.1.0
[7] assertthat_0.2.0 stats4_3.5.1 yaml_2.2.0
[10] pillar_1.3.0 backports_1.1.2 lattice_0.20-35
[13] glue_1.3.0 digest_0.6.15 promises_1.0.1
[16] colorspace_1.3-2 htmltools_0.3.6 httpuv_1.4.5
[19] Matrix_1.2-14 R.oo_1.22.0 plyr_1.8.4
[22] dygraphs_1.1.1.6 pkgconfig_2.0.1 rstan_2.17.3
[25] purrr_0.2.5 xtable_1.8-2 mvtnorm_1.0-8
[28] scales_1.0.0 whisker_0.3-2 later_0.7.3
[31] git2r_0.23.0 tibble_1.4.2 bayesplot_1.5.0
[34] DT_0.4 shinyjs_1.0 withr_2.1.2
[37] lazyeval_0.2.1 magrittr_1.5 crayon_1.3.4
[40] mime_0.5 evaluate_0.11 R.methodsS3_1.7.1
[43] nlme_3.1-137 xts_0.11-0 colourpicker_1.0
[46] rsconnect_0.8.8 tools_3.5.1 loo_2.0.0
[49] matrixStats_0.54.0 stringr_1.3.1 munsell_0.5.0
[52] compiler_3.5.1 rlang_0.2.1 grid_3.5.1
[55] ggridges_0.5.0 htmlwidgets_1.2 crosstalk_1.0.0
[58] igraph_1.2.1 miniUI_0.1.1.1 base64enc_0.1-3
[61] rmarkdown_1.10 gtable_0.2.0 inline_0.3.15
[64] abind_1.4-5 reshape2_1.4.3 markdown_0.8
[67] R6_2.2.2 gridExtra_2.3 rstantools_1.5.0
[70] zoo_1.8-3 knitr_1.20 bridgesampling_0.4-0
[73] shinythemes_1.1.1 bindr_0.1.1 shinystan_2.5.0
[76] workflowr_1.1.1 rprojroot_1.3-2 stringi_1.2.4
[79] parallel_3.5.1 tidyselect_0.2.4 coda_0.19-1
This reproducible R Markdown analysis was created with workflowr 1.1.1