Last updated: 2020-09-30

Checks: 7 0

Knit directory: Blancetal/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200217) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3294a81. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    data/quaint-results.rda
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  data/schaefer_clusters.csv~
    Untracked:  figures/Supplement_Ve.png
    Untracked:  figures/phist.png

Unstaged changes:
    Modified:   analysis/Environmental_variance.Rmd
    Modified:   analysis/Selection_on_Expression_of_Env_Rsponse_Genes.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Selection_on_expression_of_coexpreession_clusters.Rmd) and HTML (docs/Selection_on_expression_of_coexpreession_clusters.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 3294a81 jgblanc 2020-09-30 new clusters
html 9ede1d4 jgblanc 2020-04-24 Build site.
Rmd 69c5a29 jgblanc 2020-04-24 Finished clusters
html e26e754 jgblanc 2020-04-24 Build site.
Rmd 3a6666a jgblanc 2020-04-24 Finished clusters
Rmd 8298d4d GitHub 2020-04-16 Merge branch ‘master’ into master
Rmd 6b00f47 em 2020-03-27 stuff
Rmd a63fa86 em 2020-03-26 stuff
html 7591f0d jgblanc 2020-03-16 Build site.
Rmd df4fe40 jgblanc 2020-03-16 added more pvals
Rmd 2cb89cd jgblanc 2020-03-13 adding coexpression stuff

Intro

Here is the code to test for expression on coeexpression clusters from Walleey et al.

Code

For this analysis we want to test for selection within specific coexpression modules. We used coexpression modules from Walley et al. (2016) who used weight gene coexpression network analysis (WGCNA) to to group together genes that were similarly expressed in at least 4 tissues in one maize inbred line. Their analysis resulted in 66 co-expression networks. Below we will load their co-expression networks and select all the clusters that have at least 100 genes in them.

modules <- read.delim("../data/Modules.txt",na.strings=c("","NA"), stringsAsFactors = F)
num_genes <- apply(modules, 2, function(x) length(which(!is.na(x))))
num_genes_100 <- which(num_genes >= 100)
modules  <- modules[,num_genes_100]

We now have 51 modules that have at least 100 genes. We will now get the median expression of all the genes that are present in our RNAseq dataset for each module and then test for selection on the median values in each cluster. The code below is similar code used to identify selected genes, we are simply conducting the test on median expression values in each line instead of a single gene expression in each line. Again we test for selection on the first 5 PCs and use the last half of PCs to estimate \(V_a\). Th function will return the p-values from the \(Q_{pc}\) test for all tissues.

cluster_analysis_func <- function(modules) {

alltissues = c('GRoot',"Kern","LMAD26","LMAN26", "L3Tip","GShoot","L3Base")

alltissuemodules = lapply(alltissues, function(mytissue){
  
##get the names of genes we have expression data for in each tissue
exp <- read.table(paste("../data/Raw_expression/",mytissue,".txt", sep="")) #  read in expression level
geneNames = names(exp)

#get the median expression level for each module in this tissue
moduleExpression = apply(modules, 2, function(x){
  olap = x[x %in% geneNames] #get genes in the module tha we have expression data for
  moduleExp = dplyr::select(exp, all_of(olap)) #pull out the expression data for these genes
  medianExp = apply(moduleExp, 1, median)
  return(medianExp)
})

####identify selection in each of these modules
# Read in tissue specific kinship matrix 
myF <- read.table(paste('../data/Kinship_matrices/F_',mytissue,'.txt',sep=""))
## Get Eigen Values and Vectors 
myE <- eigen(myF)
E_vectors <- myE$vectors
E_values <- myE$values
## Testing for selection on first 5 PCs
myM = 1:5
## Using the last 1/2 of PCs to estimate Va
myL = 6:dim(myF)[1]
### test for selection on all modules
moduleSelection <- apply(moduleExpression, 2, function(x){
meanCenteredX = x[-length(x)] - mean(x)
myQpc = calcQpc(myZ = meanCenteredX, myU = E_vectors, myLambdas = E_values, myL = myL, myM = myM)
return(myQpc$pvals)
})
##make a dataframe with info to return
mydf = data.frame(t(moduleSelection))
names(mydf) = c('PC1','PC2','PC3','PC4','PC5')
mydf$module = row.names(mydf)
row.names(mydf) <- NULL
mydflong = tidyr::gather(mydf, 'PC','pval', PC1:PC5)
mydflong$tissue = mytissue
return(mydflong)})

return(alltissuemodules)
}

We have the p-values, let’s calculate the FDR and look for significant p-values.

alltissuemodules = cluster_analysis_func(modules)

#combine all into one list
bigdf <- do.call('rbind', alltissuemodules)

bigdf$fdr <- p.adjust(bigdf$pval, method='fdr') ##calculate an FDR
summary(bigdf$fdr) ##nothing is significant
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4299  0.9996  0.9996  0.9990  0.9996  0.9996 
kable(bigdf[which.min(bigdf[,3]),])
module PC pval tissue fdr
732 RNA_Root_Meristem PC5 0.0002409 LMAD26 0.4299294

There is no significant selection on coexpression clusters. The lowest FDR is for the “RNA Root Meristem” cluster in adult leaf tissue on PC 5. We can also look at the clusters with with lowest bonferroni corrected p-values

t <- bigdf %>% group_by(PC, tissue) %>% mutate(bonferroni = p.adjust(pval, method='bonferroni')) %>% arrange(bonferroni)
kable(head(t))
module PC pval tissue fdr bonferroni
RNA_Root_Meristem PC5 0.0002409 LMAD26 0.4299294 0.0122837
RNA_FemaleSpikIt_._VegMeristem PC1 0.0012924 Kern 0.7985099 0.0659101
Protein_Germinating_Kernals PC1 0.0017164 Kern 0.7985099 0.0875356
RNA_Root_Cortex_2 PC1 0.0017894 Kern 0.7985099 0.0912583
Protein_EndoCrown.PericarpAl PC1 0.0039731 Kern 0.9996316 0.2026281
RNA_Root_Cortex PC1 0.0053132 Kern 0.9996316 0.2709719

Clusters from Schaefer et al.

ZmRoot Clusters - get clustes with >= 100 genes

clusters <- fread("../data/schaefer_clusters.csv")
clusters <- clusters[,1:4]

dat <- clusters %>% group_by(ZmRoot) %>% mutate(num_genes = n()) %>% filter(num_genes >= 100)
names <- unique(dat$ZmRoot)

gene_list <- list()
max <- nrow(dat[dat$ZmRoot == 1,])
for(i in 1:length(names)) {
  df <- clusters %>% filter(ZmRoot == names[i])
  genes <- c(df$V1, rep(NA, max-nrow(df)))
  gene_list[[i]] <- genes
}

modules <- do.call(cbind, gene_list)

Run analysis and FDR correction

alltissuemodules = cluster_analysis_func(modules)

#combine all into one list
bigdf <- do.call('rbind', alltissuemodules)

bigdf$fdr <- p.adjust(bigdf$pval, method='fdr') ##calculate an FDR
summary(bigdf$fdr) ##nothing is significant
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4377  0.9951  0.9951  0.9889  0.9951  0.9951 
kable(bigdf[which.min(bigdf[,3]),])
module PC pval tissue fdr
158 4 PC5 0.0011368 LMAD26 0.4376829

Look at lowest p-value clusters

t <- bigdf %>% group_by(PC, tissue) %>% mutate(bonferroni = p.adjust(pval, method='bonferroni')) %>% arrange(bonferroni)
kable(head(t))
module PC pval tissue fdr bonferroni
4 PC5 0.0011368 LMAD26 0.4376829 0.0125052
3 PC5 0.0023537 LMAD26 0.4530808 0.0258903
9 PC1 0.0053389 Kern 0.5741055 0.0587274
8 PC2 0.0063947 LMAD26 0.5741055 0.0703414
9 PC3 0.0074559 LMAN26 0.5741055 0.0820151
11 PC1 0.0186208 Kern 0.9950821 0.2048286

ZmSAM Clusters - get clustes with >= 100 genes

dat <- clusters %>% group_by(ZmSAM) %>% mutate(num_genes = n()) %>% filter(num_genes >= 100)
names <- unique(dat$ZmSAM)

gene_list <- list()
max <- nrow(dat[dat$ZmSAM == 0,])
for(i in 1:length(names)) {
  df <- clusters %>% filter(ZmSAM == names[i])
  genes <- c(df$V1, rep(NA, max-nrow(df)))
  gene_list[[i]] <- genes
}

modules <- do.call(cbind, gene_list)

Run analysis and FDR correction

alltissuemodules = cluster_analysis_func(modules)

#combine all into one list
bigdf <- do.call('rbind', alltissuemodules)

bigdf$fdr <- p.adjust(bigdf$pval, method='fdr') ##calculate an FDR
summary(bigdf$fdr) ##nothing is significant
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9999  0.9999  0.9999  0.9999  0.9999  0.9999 
kable(bigdf[which.min(bigdf[,3]),])
module PC pval tissue fdr
184 2 PC5 0.0033344 LMAD26 0.9999109

Look at lowest p-value clusters

t <- bigdf %>% group_by(PC, tissue) %>% mutate(bonferroni = p.adjust(pval, method='bonferroni')) %>% arrange(bonferroni)
kable(head(t))
module PC pval tissue fdr bonferroni
2 PC5 0.0033344 LMAD26 0.9999109 0.0433477
7 PC3 0.0315067 LMAN26 0.9999109 0.4095866
10 PC1 0.0320505 Kern 0.9999109 0.4166559
4 PC5 0.0436062 LMAD26 0.9999109 0.5668812
2 PC1 0.0536048 Kern 0.9999109 0.6968619
5 PC5 0.0560459 LMAN26 0.9999109 0.7285967

ZmPAN Clusters - get clustes with >= 100 genes

dat <- clusters %>% group_by(ZmPAN) %>% mutate(num_genes = n()) %>% filter(num_genes >= 100)
names <- unique(dat$ZmPAN)

gene_list <- list()
max <- nrow(dat[dat$ZmPAN == 0,])
for(i in 1:length(names)) {
  df <- clusters %>% filter(ZmPAN == names[i])
  genes <- c(df$V1, rep(NA, max-nrow(df)))
  gene_list[[i]] <- genes
}

modules <- do.call(cbind, gene_list)

Run analysis and FDR correction

alltissuemodules = cluster_analysis_func(modules)

#combine all into one list
bigdf <- do.call('rbind', alltissuemodules)

bigdf$fdr <- p.adjust(bigdf$pval, method='fdr') ##calculate an FDR
summary(bigdf$fdr) ##nothing is significant
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9992  0.9992  0.9992  0.9992  0.9992  0.9992 
kable(bigdf[which.min(bigdf[,3]),])
module PC pval tissue fdr
241 3 PC5 0.0045783 LMAD26 0.9991676

Look at lowest p-value clusters

t <- bigdf %>% group_by(PC, tissue) %>% mutate(bonferroni = p.adjust(pval, method='bonferroni')) %>% arrange(bonferroni)
kable(head(t))
module PC pval tissue fdr bonferroni
3 PC5 0.0045783 LMAD26 0.9991676 0.0778318
2 PC5 0.0090431 GShoot 0.9991676 0.1537334
1 PC5 0.0092036 LMAN26 0.9991676 0.1564620
13 PC1 0.0129806 Kern 0.9991676 0.2206705
13 PC5 0.0142503 LMAD26 0.9991676 0.2422548
10 PC2 0.0157494 LMAN26 0.9991676 0.2677405

sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.29        data.table_1.12.8 dplyr_1.0.0       quaint_0.0.0.9000
[5] ggpubr_0.3.0      ggplot2_3.3.2     reshape2_1.4.4    workflowr_1.6.2  

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0 xfun_0.15        purrr_0.3.4      haven_2.3.1     
 [5] lattice_0.20-41  carData_3.0-4    colorspace_1.4-1 vctrs_0.3.1     
 [9] generics_0.0.2   htmltools_0.5.0  yaml_2.2.1       rlang_0.4.6     
[13] later_1.1.0.1    pillar_1.4.4     foreign_0.8-72   glue_1.4.1      
[17] withr_2.2.0      readxl_1.3.1     lifecycle_0.2.0  plyr_1.8.6      
[21] stringr_1.4.0    cellranger_1.1.0 munsell_0.5.0    ggsignif_0.6.0  
[25] gtable_0.3.0     zip_2.0.4        evaluate_0.14    rio_0.5.16      
[29] forcats_0.5.0    httpuv_1.5.4     curl_4.3         highr_0.8       
[33] broom_0.5.6      Rcpp_1.0.4.6     promises_1.1.1   scales_1.1.1    
[37] backports_1.1.8  abind_1.4-5      fs_1.4.1         hms_0.5.3       
[41] digest_0.6.25    stringi_1.4.6    openxlsx_4.1.5   rstatix_0.6.0   
[45] grid_3.6.2       rprojroot_1.3-2  tools_3.6.2      magrittr_1.5    
[49] tibble_3.0.1     crayon_1.3.4     whisker_0.4      tidyr_1.1.0     
[53] car_3.0-8        pkgconfig_2.0.3  ellipsis_0.3.1   rmarkdown_2.3   
[57] R6_2.4.1         nlme_3.1-148     git2r_0.27.1     compiler_3.6.2