20191001_GtexAnnotations

Last updated: 2019-10-03

Checks: 6 1

Knit directory: Comparative_eQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190319)

The command set.seed(20190319) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 65c8360

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/20190521_eQTL_CrossSpeciesEnrichment_cache/
    Ignored:    analysis_temp/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    code/snakemake_workflow/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/PastAnalysesDataToKeep/.DS_Store
    Ignored:    docs/.DS_Store
    Ignored:    docs/assets/.DS_Store

Untracked files:
    Untracked:  analysis/20191001_GtexAnnotations.Rmd
    Untracked:  analysis_temp/20190716_VarianceInsteadOfEgenes_2.Rmd
    Untracked:  docs/figure/20190930_OverdispersionEstimates.Rmd/

Unstaged changes:
    Deleted:    analysis/20190716_VarianceInsteadOfEgenes_2.Rmd
    Modified:   analysis/20190930_OverdispersionEstimates.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

From my dataset of gene expression measurements across 39 chimp individuals (RNA-seq from heart tissue) and ~39 human individuals (~50 RNA-seq sampels randomly chosen from human GTEx datasets, then culled to exclude outliers and lowly sequenced samples) I am interested in quantifying gene expression dispersion within each population. I had previously made the observation that gene expression overdispersion is well correlated (\(R^2\approx0.5\)) between the two species, even after regressing out the fact that variance is correlated to expression (due to both biological and statstical reasons). This amount of correlation is higher than I would have guessed, given how small the overlap is between eGenes that I could detect. However, I still don’t have a sense of what I should have expected. To gain some intuition on this, I would like to use other GTEx tissues to estimate gene-wise overdispersion parameters, and compare the correlation across tissues. First, let’s peruse the GTEx samples available to us, to pick ~40 reasonable samples per tissue for this analysis.

library(tidyverse)

Samples <- read.table("../data/GTExAnnotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt", header=T, sep='\t', quote="")
Subjects <- read.table("../data/GTExAnnotations/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt", header=T, sep='\t')
GtexColors <- read.table("../data/GTEx_Analysis_TissueColorCodes.txt", sep='\t', header=T, stringsAsFactors = F) %>% mutate(HEX=paste0("#",Color.code))

TissueColorVector <- GtexColors$HEX
names(TissueColorVector)<-GtexColors$Tissue

#Num samples per tissue. Note that I may want to filter out some samples to draw from ones where experimental protocols (like RNA extraction method) were kept reasonably constant
Samples %>%
  filter(SMAFRZE=="RNASEQ") %>%
  count(SMTSD,SMNABTCHT) %>%
  filter(SMTSD %in% GtexColors$Tissue) %>% mutate(sum=sum(n)) %>%
  ggplot(aes(x=SMTSD, y=n, fill=SMTSD, color=SMNABTCHT)) +
    geom_col(show.legend = T) +
    ylab("Number Samples") +
    theme_bw() +
    scale_fill_manual(values=TissueColorVector, guide=FALSE) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
    theme(legend.position="top") +
    guides(color=guide_legend(nrow=2,byrow=TRUE))

#Histogram of mapping rate
hist(Samples$SMMAPRT)

#Histogram of rRNA mapping rate
hist(Samples$SMRRNART)

#Histogram of exonic mapping rate
hist(Samples$SMEXNCRT)

#Histogram of mapped library size
hist(log10(Samples$SMMPPD))

#RIN score
hist(Samples$SMRIN)

#types of library preps
table(Samples$SMGEBTCHT)


                Affymetrix Expression 
                                  837 
                             ChIP-Seq 
                                  356 
                            DroNc-seq 
                                   20 
   Exome Express (Agilent SureSelect) 
                                  100 
       Illumina Human Exome SNP Array 
                                  189 
              Illumina OMNI SNP Array 
                                  462 
PCR-Free 30x Coverage WGS v1 (HiSeqX) 
                                  249 
    PCR+ 30x Coverage WGS (HiSeq2000) 
                                   79 
 PCR+ 30x Coverage WGS v1 (HiSeq2000) 
                                    1 
    PCR+ 30x Coverage WGS v1 (HiSeqX) 
                                    1 
    PCR+ 30x Coverage WGS v2 (HiSeqX) 
                                  579 
                              RIP-Seq 
                                  226 
   Standard Exome Sequencing v2 (ICE) 
                                   80 
   Standard Exome Sequencing v3 (ICE) 
                                  510 
   Standard Exome Sequencing v5 (ICE) 
                                  315 
                             Tech Dev 
                                   24 
      Tech Dev - Large Insert RNA-seq 
                                   65 
                            TruSeq.v1 
                                18858

#types of nucleic acid preps
table(Samples$SMNABTCHT)


                                                       
                                                   376 
DNA Extraction from Paxgene-derived Lysate Plate Based 
                                                   117 
      DNA Isolation from Tissue via QIAgen Spin Column 
                                                    87 
      DNA Isolation of Compromised Blood with Autopure 
                                                    30 
  DNA isolation_Whole Blood_QIAGEN Puregene (Autopure) 
                                                    40 
    DNA isolation_Whole Blood_QIAGEN Puregene (Manual) 
                                                  2285 
RNA Extraction from Paxgene-derived Lysate Plate Based 
                                                 11760 
              RNA isolation_PAXgene Blood RNA (Manual) 
                                                   933 
                    RNA isolation_PAXgene Tissue miRNA 
                                                  6387 
             RNA isolation_Trizol Manual (Cell Pellet) 
                                                   936

#sample type. RNASEQ type was deemed suitable for GTEx v8 eQTL calling
table(Samples$SMAFRZE)


        EXCLUDE    OMNI  RNASEQ     WES     WGS 
    602    2700     450   17382     979     838

#Now let's randomly pick ~40 samples that are  TruSeq.v1 and have coverage comparable to my ChimpSamples (~70M mapped reads), and are similar in other metrics (RNA Extraction from Paxgene-derived Lysate Plate Based).



Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

# Apply filters, like keeping RNA-extraction method consistent within tissue, minimum mapping rate, etc.
SamplesFilteredForTissuesSuitableForAnalysis <- Samples %>%
  mutate(IndidualID=gsub("GTEX-(\\w+)-.+", "\\1", SAMPID, perl=T)) %>%
  # filter(SMNABTCHT %in% c("RNA Extraction from Paxgene-derived Lysate Plate Based")) %>%
  filter(SMAFRZE=="RNASEQ") %>%
  group_by(SMTSD) %>%
  mutate(c=Mode(SMNABTCHT)) %>%
  filter(quantile(SMMPPD, 0.1)<SMMPPD &
        quantile(SMMAPRT, 0.1)<SMMAPRT &
        quantile(SMRIN, 0.1)<SMRIN) %>%
  ungroup() %>%
  filter(SMNABTCHT==c) %>%
  add_count(SMTSD, name="CountBeforeSampling") %>%
  filter(SMTSD %in% GtexColors$Tissue) %>%
  filter(CountBeforeSampling>40)

# Acceptable samples to draw from.
SamplesFilteredForTissuesSuitableForAnalysis %>%
  group_by(SMTSD) %>%
  count(SMTSD, SMNABTCHT) %>%
  ggplot(aes(x=SMTSD, y=n, fill=SMTSD, color=SMNABTCHT)) +
    geom_col(show.legend = T) +
    ylab("Number Samples") +
    theme_bw() +
    scale_fill_manual(values=TissueColorVector, guide=FALSE) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
    theme(legend.position="top") +
    guides(color=guide_legend(nrow=2,byrow=TRUE))

# Samples drawn at random to keep sample size consistent
set.seed(0)
SuitableSubsample <- SamplesFilteredForTissuesSuitableForAnalysis  %>%
  group_by(SMTSD) %>%
  sample_n(40) %>% ungroup()
SuitableSubsample %>%
  count(SMTSD, SMNABTCHT) %>%
  ggplot(aes(x=SMTSD, y=n, fill=SMTSD, color=SMNABTCHT)) +
    geom_col(show.legend = T) +
    ylab("Number Samples") +
    theme_bw() +
    scale_fill_manual(values=TissueColorVector, guide=FALSE) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
    theme(legend.position="top") +
    guides(color=guide_legend(nrow=2,byrow=TRUE))

#How many tissues exist for this analysis
SuitableSubsample %>% distinct(SMTSD) %>% dim()

[1] 48  1

# How many times do the same individuals show up in multiple tissues for this subsample?
SuitableSubsample %>% select(SAMPID, SMTSD, IndidualID) %>%
  count(IndidualID) %>% pull(n) %>% hist(ylab="#individuals with X tissues in subsample")

#Could be worth also making a subsample of the same 40 people for all tissues.
#Unfortunately that wouldn't be possible. Would need to consider less tissues, or less individuals.
SamplesFilteredForTissuesSuitableForAnalysis %>%
  count(IndidualID) %>% pull(n) %>% hist(ylab="#individuals with X tissues in subsample")

SuitableSubsample %>% select(SAMPID, SMTSD, IndidualID) %>%
  write.table("../data/OverdispersionGTExAnalysisSampleList.txt", sep="\t", quote=F, row.names = F)

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.1     purrr_0.3.2    
[5] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.1.1  
[9] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       cellranger_1.1.0 plyr_1.8.4       pillar_1.4.1    
 [5] compiler_3.5.1   git2r_0.25.2     workflowr_1.4.0  tools_3.5.1     
 [9] digest_0.6.19    lubridate_1.7.4  jsonlite_1.6     evaluate_0.14   
[13] nlme_3.1-140     gtable_0.3.0     lattice_0.20-38  pkgconfig_2.0.2 
[17] rlang_0.3.4      cli_1.1.0        rstudioapi_0.10  yaml_2.2.0      
[21] haven_2.1.0      xfun_0.7         withr_2.1.2      xml2_1.2.0      
[25] httr_1.4.0       knitr_1.23       hms_0.4.2        generics_0.0.2  
[29] fs_1.3.1         rprojroot_1.3-2  grid_3.5.1       tidyselect_0.2.5
[33] glue_1.3.1       R6_2.4.0         readxl_1.3.1     rmarkdown_1.13  
[37] modelr_0.1.4     magrittr_1.5     backports_1.1.4  scales_1.0.0    
[41] htmltools_0.3.6  rvest_0.3.4      assertthat_0.2.1 colorspace_1.4-1
[45] labeling_0.3     stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0   
[49] broom_0.5.2      crayon_1.3.4

20191001_GtexAnnotations

Ben Fair

10/1/2019