GWAS for fitness in Drosophila

Estimate the line means for relative fitness

Define functions for prediction
Predict line means and save the results
Summary of model used to predict FEMALE line means:
Summary of model used to predict MALE line means:
Proportion of variance explained by ‘DGRP line’ in these two models:

Session information

Last updated: 2018-09-16

workflowr checks: (Click a bullet for more information)

✖ R Markdown file: uncommitted changes
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20180914)

The command set.seed(20180914) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

✔ Repository version: 449a929

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/derived/.DS_Store
    Ignored:    data/derived/output/.DS_Store
    Ignored:    data/input/.DS_Store
    Ignored:    figures/.DS_Store

Untracked files:
    Untracked:  code/Drosophila_GWAS.Rmd
    Untracked:  data/derived/output/male_late_lm,.assoc.txt
    Untracked:  data/derived/output/male_late_lm,.log.txt
    Untracked:  data/derived/trimmed_DGRP.bk

Unstaged changes:
    Modified:   analysis/get_predicted_line_means.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/perform_gwas.Rmd
    Modified:   analysis/plot_line_means.Rmd
    Modified:   data/derived/output/DGRP_GRM.log.txt
    Modified:   data/derived/output/female_early_bslmm.bv.txt
    Modified:   data/derived/output/female_early_bslmm.gamma.txt
    Modified:   data/derived/output/female_early_bslmm.hyp.txt
    Modified:   data/derived/output/female_early_bslmm.log.txt
    Modified:   data/derived/output/female_early_bslmm.param.txt
    Modified:   data/derived/output/female_early_female_late.log.txt
    Modified:   data/derived/output/female_early_lm.log.txt
    Modified:   data/derived/output/female_early_lmm.log.txt
    Modified:   data/derived/output/female_early_male_early.log.txt
    Modified:   data/derived/output/female_late_lm.log.txt
    Modified:   data/derived/output/female_late_lmm.log.txt
    Modified:   data/derived/output/female_late_male_late.log.txt
    Modified:   data/derived/output/male_early_lm.log.txt
    Modified:   data/derived/output/male_early_lmm.log.txt
    Modified:   data/derived/output/male_early_male_late.log.txt
    Modified:   data/derived/output/male_late_lmm.log.txt
    Modified:   figures/figure1.eps
    Modified:   figures/figure2.eps

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
html	449a929	Luke Holman	2018-09-14	Build site.
html	e981b22	Luke Holman	2018-09-14	Made GWAS page
Rmd	64799c1	Luke Holman	2018-09-14	Made get_predicted_line_means
html	64799c1	Luke Holman	2018-09-14	Made get_predicted_line_means

library(dplyr)
library(brms)
library(pander)

Estimate the line means for relative fitness

Most studies of the DGRP calculate so-called “line means” by averaging the phenotypes of several individuals per line. Instead, we used Bayesian generalised linear mixed model (GLMMs) to estimate the true line means for each of the 4 fitness traits.

One notable advantage of using a model instead of a simple average, is that this allows one to statistically remove confounding factors such as block effects, which would otherwise add noise to the line means. Note that we intentionally measured line 352 in every block, in order to have good precision to estimate the block effect.

Additionally, we must properly account for our experimental design in order to avoid pseudoreplication. Our design involved repeatedly measuring the same individuals to determine early- and late-life fitness.

For the progeny count data used to measure female fitness, we used a Poisson GLMM with the following formula:

$Y = age + (age|line) + (age|block) + (1|vial.ID)$

That is, we specify a random intercept for each vial, block and line. Additionally, we specify a random slope with age for each line and block, allowing for differential change in fitness with age between lines and blocks. The then use the fitted function in the brms package to derive the estimated mean progeny production for each combination of age and line. The line means were left on the scale of the linear predictor (as opposed to back-transforming them via the link function), in order to increase their normality. Furthermore, they were scaled to have a mean of zero and standard deviation of one.

Similarly, for our measure of male fitness (proportion of offspring sired), we used a binomial model with the following formula:

$Y = age + (age|line) + (age|block) + (1|vial.ID)$

Where $Y$ is a 1 or 0 depending on whether the focal offspring was sired by a male from the focal DGRP line or a rival male, respectively.

Define functions for prediction

load_and_tidy_fitness_data <- function(){
  female_fitness <- read.csv("data/input/female_fitness.csv", stringsAsFactors = FALSE) %>% 
    arrange(line) %>%
    mutate(vial.ID = paste("F", block, replicate, line, sep = "_"),
           line = paste("line_", line, sep = "")) %>% 
    select(-num.F.late1, -num.F.late2, -num.laying.F.early, -replicate) 
  female_fitness$female.fitness.early[is.na(female_fitness$female.fitness.early)] <- 0
  female_fitness$female.fitness.late[is.na(female_fitness$female.fitness.late)] <- 0
  
  male_fitness <- read.csv("data/input/male_fitness.csv", stringsAsFactors = FALSE ) %>% 
    arrange(line) %>%
    mutate(vial.ID = paste("M", block, replicate, line, sep = "_"),
           line = paste("line_", line, sep = "")) %>%
    select(-num.DGRP.males, -replicate)
  
  # Save these tidy files, for data archiving
  write.csv(female_fitness, file = "data/input/female_fitness_CLEANED.csv", row.names = FALSE)
  write.csv(male_fitness, file = "data/input/male_fitness_CLEANED.csv", row.names = FALSE)
  
  list(female_fitness = female_fitness, male_fitness = male_fitness)
}


# This function calculates the means for each line
get.line.means <- function(){
  left_join(
    load_and_tidy_fitness_data()[[1]] %>% select(line, female.fitness.early, female.fitness.late) %>% 
      group_by(line) %>%
      summarise(female.fitness.early = mean(female.fitness.early), 
                female.fitness.late = mean(female.fitness.late)),
    load_and_tidy_fitness_data()[[2]] %>% 
      group_by(line) %>%
      summarise(early.male.rival = sum(early.male.rival),
                early.male.focal = sum(early.male.focal),
                late.male.rival = sum(late.male.rival),
                late.male.focal = sum(late.male.focal),
                male.fitness.early = early.male.focal / (early.male.focal + early.male.rival),
                male.fitness.late = late.male.focal / (late.male.focal + late.male.rival)) %>%
      select(line, male.fitness.early, male.fitness.late),
    by = "line") %>% as.data.frame()
}

# A function to calculate each line's expected fitness, using brms models to adjust for block + vial random effects
# For the male late life assay, we also adjust for the number of competitor males that died 
# (this assumes that their deaths happened randomly with respect to which line we were measuring)
get_predicted_line_means <- function(){
  
  # If it's already calculated, just load up the file. Otherwise, make the file
  out.file <- "data/derived/predicted_line_means.csv"
  if(file.exists(out.file)) return(read.csv(out.file, stringsAsFactors = FALSE))
  
  ####### Load the individual-level data
  # Load the female data and rearrange to long format
  female_fitness <- load_and_tidy_fitness_data()[[1]] %>% 
    gather(age, fitness, female.fitness.early, female.fitness.late)
  
  # Load the male data and make 5 imputed datasets using the package 'mice'
  # (there are 2 vials with no early-male fitness, and 1 vial with no count of the number of GFP males)
  # See this link for explanation: https://cran.r-project.org/web/packages/brms/vignettes/brms_missings.html
  male_fitness <- load_and_tidy_fitness_data()[[2]]
  male_fitness_imputed <- mice(male_fitness, m = 5, print = FALSE, seed = 1)
  male_fitness_imputed <- lapply(1:5, function(i) complete(male_fitness_imputed, i))
  male_fitness_imputed <- lapply(male_fitness_imputed, function(focal) {
    focal %>% mutate(total1 = early.male.rival + early.male.focal,
                     total2 = late.male.rival + late.male.focal)})
  
  ####### Run model on female data 
  female_model <- brm(fitness ~ age + (age|line) + (age|block) + (1|vial.ID), 
                      family = poisson, data = female_fitness,
                      iter = 4000, chains = 4, cores = 4,
                      seed = 1, control = list(adapt_delta = 0.999, max_treedepth = 15)) %>% 
    add_ic(ic = "R2")
  writeLines(capture.output(summary(female_model)), con = "data/derived/model_summary_females.txt")
  saveRDS(female_model, "data/derived/female_fitness_prediction_model.rds") # maybe delete later - probably not needed?
  
  
  ### Run model on male data 
  # Fit a non-linear smooth for num.GFP.males, but only for the late life data
  # Inspection of the raw data and model results illustrates the non-linear smooth improves model fit appreciably
  # For example, the 'wiggliness' of the smooth term is significantly non-zero
  # See: https://www.fromthebottomoftheheap.net/2018/04/21/fitting-gams-with-brms/
  male1 <- bf(early.male.focal | trials(total1) ~ (1|line) + (1|block) + (1|vial.ID))
  male2 <- bf(late.male.focal  | trials(total2) ~ s(num.GFP.males) + (1|line) + (1|block) + (1|vial.ID))
  
  # Run 5 models, on each of the 5 imputed datasets, and combine the results
  male_model <- brm_multiple(male1 + male2,
                                 family = binomial, data = male_fitness_imputed,
                                 iter = 4000, chains = 4, cores = 4,
                                 seed = 1, control = list(adapt_delta = 0.999, max_treedepth = 15)) %>%
    add_ic(ic = "R2")
  writeLines(capture.output(summary(male_model)), con = "data/derived/model_summary_males.txt")
  saveRDS(male_model, "data/derived/male_fitness_prediction_model.rds") # maybe delete later - probably not needed?
  male_model <- readRDS("data/derived/male_fitness_prediction_model.rds") 
  
  ####### Define new data for prediction
  new.data.female <- female_fitness %>% 
    select(age, line) %>% distinct() %>% arrange(line, age)
  
  new.data.male <- male_fitness %>% 
    select(line) %>% distinct() %>% arrange(line) %>% 
    mutate(total1 = 100, # I checked, and changing this number does not affect the line means, as I expected
           total2 = 100, 
           num.GFP.males = male_fitness %>% .$num.GFP.males %>% mean(na.rm=TRUE))
  
  # Get predicted lines means for early and late life female fitness on linear predictor scale (i.e. normal)
  female_line_preds <- data.frame(new.data.female, 
                                  fitted(female_model, 
                                         newdata = new.data.female, 
                                         re_formula = "~ (age|line)", 
                                         scale = "linear")) %>%
    select(age, line, Estimate) %>% 
    spread(age, Estimate) %>%
    mutate(female.fitness.early = as.numeric(scale(female.fitness.early)),
           female.fitness.late = as.numeric(scale(female.fitness.late)))
  
  # Get predicted lines means for early and late life male fitness on linear predictor scale (i.e. normal)
  male_line_preds <- data.frame(new.data.male, 
                              fitted(male_model_NEW, 
                                     newdata = new.data.male, 
                                     re_formula = "~ (1|line)", 
                                     scale = "linear")) %>%
  select(line, Estimate.earlymalefocal, Estimate.latemalefocal) %>% 
  mutate(male.fitness.early = as.numeric(scale(Estimate.earlymalefocal)),
         male.fitness.late = as.numeric(scale(Estimate.latemalefocal))) %>%
  select(line, male.fitness.early, male.fitness.late)

  # Calculate the proportion of variance explained by line for females
  
  ## Get the posterior predicted values for each data point
  predicted <- data.frame(female_fitness[complete.cases(female_fitness), ],
                          t(fitted(female_model, summary = FALSE)))
  
  ## Define a function to get the ICC for line, for each posterior sample of predictions
  get_ICC_line <- function(predictions, age_class = NULL){
    if(!is.null(age_class)) predictions <- predictions %>% filter(age == age_class) 
    first.col <- which(str_detect(names(predictions), "X"))[1] 
    sapply(0:7999, function(i){
       ICCbare(predictions$line, predictions[, i+first.col])
    })
  }
  
  # Calculate ICC for each sample, separately for old and young females
  Line_ICC_female_early <- get_ICC_line(predicted, "female.fitness.early")
  Line_ICC_female_late <- get_ICC_line(predicted, "female.fitness.late")
  
  # Do the same for males
  male_preds_early <-  data.frame(line = male_fitness$line,
                                  t(fitted(male_model, summary = FALSE)[,,1]))
  male_preds_late <-  data.frame(line = male_fitness$line,
                                 t(fitted(male_model, summary = FALSE)[,,1]))

  Line_ICC_male_early <- get_ICC_line(male_preds_early)
  Line_ICC_male_late <- get_ICC_line(male_preds_late)
  
  
  ICC <- posterior_summary(as.mcmc(data.frame(Line_ICC_female_early, 
                                              Line_ICC_female_late,
                                              Line_ICC_male_early,
                                              Line_ICC_male_late)))
  write.csv(ICC, file = "data/derived/ICC.csv", row.names = TRUE)
  

  # Record which block(s) each line was measured in. Generally one block, except for ref line (all blocks)
  line_blocks <- rbind(female_fitness %>% select(block, line), 
                       male_fitness   %>% select(block, line)) %>% 
    mutate(block = replace(block, line == "352", "reference_line")) %>% 
    distinct() %>% 
    group_by(line) %>% summarise(block = paste0(block, collapse = ", ")) 
  
  predicted_line_means <- female_line_preds %>%
    left_join(male_line_preds, by = "line") %>% 
    left_join(line_blocks, by = "line") %>%
    arrange(line) %>%
    mutate(line = paste("line_", line, sep = "")) 
  
  write.csv(predicted_line_means, out.file, row.names = FALSE)
  predicted_line_means %>% mutate(line = gsub("line_line_", "line_", line))
}

Predict line means and save the results

The object line_means is the raw line means: unscaled, arithmetic means and in the original units.

The object predicted_line_means is the estimated line means from the Bayesian models. These means are on the scale of the linear predictor (i.e. Poisson for females, binomial for males) rather than on the scale of the original response. They have been scaled to have a mean of 0 and a variance of 1.

line_means <- get.line.means()
predicted_line_means <- get_predicted_line_means()

Summary of model used to predict FEMALE line means:

cat(readLines('data/derived/model_summary_females.txt'), sep = '\n')

 Family: poisson 
  Links: mu = log 
Formula: fitness ~ age + (age | line) + (age | block) + (1 | vial.ID) 
   Data: female_fitness (Number of observations: 1500) 
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1; 
         total post-warmup samples = 8000
    ICs: LOO = NA; WAIC = NA; R2 = 0.94
 
Group-Level Effects: 
~block (Number of levels: 9) 
                                      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)                             0.32      0.12     0.16     0.62       5698 1.00
sd(agefemale.fitness.late)                1.29      0.43     0.74     2.35       4667 1.00
cor(Intercept,agefemale.fitness.late)    -0.41      0.32    -0.88     0.33       8000 1.00

~line (Number of levels: 115) 
                                      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)                             0.54      0.04     0.46     0.63       4294 1.00
sd(agefemale.fitness.late)                1.58      0.13     1.34     1.87       4293 1.00
cor(Intercept,agefemale.fitness.late)     0.41      0.10     0.21     0.58       3559 1.00

~vial.ID (Number of levels: 750) 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(Intercept)     0.48      0.02     0.45     0.51       3005 1.00

Population-Level Effects: 
                       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept                  4.40      0.13     4.15     4.65       8000 1.00
agefemale.fitness.late    -2.96      0.49    -3.96    -2.00       8000 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

Summary of model used to predict MALE line means:

cat(readLines('data/derived/model_summary_males.txt'), sep = '\n')

 Family: MV(binomial, binomial) 
  Links: mu = logit
         mu = logit 
Formula: early.male.focal | trials(total1) ~ (1 | line) + (1 | block) + (1 | vial.ID) 
         late.male.focal | trials(total2) ~ s(num.GFP.males) + (1 | line) + (1 | block) + (1 | vial.ID) 
   Data: male_fitness_imputed (Number of observations: 750) 
Samples: 20 chains, each with iter = 4000; warmup = 2000; thin = 5; 
         total post-warmup samples = 8000
    ICs: LOO = NA; WAIC = NA; R2 = 0.98
 
Smooth Terms: 
                                    Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sds(latemalefocal_snum.GFP.males_1)     2.80      2.11     0.27     8.30       7683 1.00

Group-Level Effects: 
~block (Number of levels: 9) 
                             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(earlymalefocal_Intercept)     0.73      0.24     0.41     1.34       7026 1.00
sd(latemalefocal_Intercept)      1.00      0.38     0.48     1.94       7293 1.00

~line (Number of levels: 115) 
                             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(earlymalefocal_Intercept)     0.57      0.06     0.47     0.68       7060 1.00
sd(latemalefocal_Intercept)      1.37      0.15     1.09     1.67       7202 1.00

~vial.ID (Number of levels: 750) 
                             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(earlymalefocal_Intercept)     0.81      0.03     0.76     0.87       5630 1.01
sd(latemalefocal_Intercept)      1.91      0.08     1.76     2.08       6832 1.00

Population-Level Effects: 
                               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
earlymalefocal_Intercept           0.30      0.27    -0.25     0.81       7744 1.00
latemalefocal_Intercept           -1.17      0.39    -1.94    -0.37       7624 1.00
latemalefocal_snum.GFP.males_1     0.08      0.60    -1.08     1.43       7941 1.00

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).

Proportion of variance explained by ‘DGRP line’ in these two models:

read.csv('data/derived/ICC.csv') %>% 
  rename(Parameter = X, Lower.95.CI = X2.5.ile, Upper.95.CI = X97.5.ile) %>% pander()

Parameter	Estimate	Est.Error	Lower.95.CI	Upper.95.CI
Line_ICC_female_early	0.6044	0.007116	0.5903	0.6183
Line_ICC_female_late	0.8514	0.006105	0.8392	0.863
Line_ICC_male_early	0.5089	0.004727	0.4996	0.5182
Line_ICC_male_late	0.5089	0.004727	0.4996	0.5182

Session information

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2 pander_0.6.2   brms_2.4.0     ggplot2_3.0.0 
[5] Rcpp_0.12.18   dplyr_0.7.6   

loaded via a namespace (and not attached):
 [1] Brobdingnag_1.2-5    R.utils_2.7.0        gtools_3.8.1        
 [4] StanHeaders_2.17.2   threejs_0.3.1        shiny_1.1.0         
 [7] assertthat_0.2.0     stats4_3.5.1         yaml_2.2.0          
[10] pillar_1.3.0         backports_1.1.2      lattice_0.20-35     
[13] glue_1.3.0           digest_0.6.15        promises_1.0.1      
[16] colorspace_1.3-2     htmltools_0.3.6      httpuv_1.4.5        
[19] Matrix_1.2-14        R.oo_1.22.0          plyr_1.8.4          
[22] dygraphs_1.1.1.6     pkgconfig_2.0.1      rstan_2.17.3        
[25] purrr_0.2.5          xtable_1.8-2         mvtnorm_1.0-8       
[28] scales_1.0.0         whisker_0.3-2        later_0.7.3         
[31] git2r_0.23.0         tibble_1.4.2         bayesplot_1.5.0     
[34] DT_0.4               shinyjs_1.0          withr_2.1.2         
[37] lazyeval_0.2.1       magrittr_1.5         crayon_1.3.4        
[40] mime_0.5             evaluate_0.11        R.methodsS3_1.7.1   
[43] nlme_3.1-137         xts_0.11-0           colourpicker_1.0    
[46] rsconnect_0.8.8      tools_3.5.1          loo_2.0.0           
[49] matrixStats_0.54.0   stringr_1.3.1        munsell_0.5.0       
[52] compiler_3.5.1       rlang_0.2.1          grid_3.5.1          
[55] ggridges_0.5.0       htmlwidgets_1.2      crosstalk_1.0.0     
[58] igraph_1.2.1         miniUI_0.1.1.1       base64enc_0.1-3     
[61] rmarkdown_1.10       gtable_0.2.0         inline_0.3.15       
[64] abind_1.4-5          reshape2_1.4.3       markdown_0.8        
[67] R6_2.2.2             gridExtra_2.3        rstantools_1.5.0    
[70] zoo_1.8-3            knitr_1.20           bridgesampling_0.4-0
[73] shinythemes_1.1.1    bindr_0.1.1          shinystan_2.5.0     
[76] workflowr_1.1.1      rprojroot_1.3-2      stringi_1.2.4       
[79] parallel_3.5.1       tidyselect_0.2.4     coda_0.19-1

This reproducible R Markdown analysis was created with workflowr 1.1.1