20190412_Check

Last updated: 2019-04-25

Checks: 5 1

Knit directory: Comparative_eQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190319)

The command set.seed(20190319) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: 680e39e

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/20190327_MakeCovariateFiles.Rmd
    Untracked:  analysis/20190421_RegressOutRNASeqPCs.Rmd
    Untracked:  docs/figure/20190424_Check_eQTLs.Rmd/

Unstaged changes:
    Modified:   analysis/20190412_Check_eQTLs.Rmd
    Modified:   analysis/index.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	4c2b69f	Benjmain Fair	2019-04-19	a messy commit

library(tidyverse)
library(knitr)
library(data.table)

My first pass at eqtl mapping was as follows: Genotypes filtered for MAF>0.5 & GenotypingRate>0.9. ~10.5M variants passed this filter. Gene expression as log(TPM), filtering for genes with >0 TPM in all samples (~17,500 genes passed this filter). Association testing used the following linear mixed model for each cis-variant-gene-pair (cis definied as <1Mb from gene):

\[ Y =Wα+xβ+u+ε \]

where \(Y\) is gene expression as \(log(TPM)\), \(W\) covariates include first three genotype principal components (to account for population structure) as well as 3 RNA-seq PCs, Sex, and an intercept. \(x\) is coded as 0,1,2, \(U \sim MVN(0,\sigma^2 K)\) where \(K\) a the kinship matrix calculated by KING-robust and negative cells are thresholded at 0 (This results in a symetric positive definite matrix). This wqs implemented in the R package ‘MatrixEQTL’. This resulted in ~7000 eSNP-gene pairs, across ~350eGenes at FDR<0.1 (Benjamini Hodgeberg correction).

Here I want to check that the results of this analysis are reasonable, starting by checking boxplots of gene expression stratified by genotype for a handful of significant eSNP-gene pairs.

# Read in genotypes for eQTLs
Genotypes <- read.table("../data/20190410_eqtls.sig_genotypes.raw", header=T, check.names = F)
# Genotypes[!duplicated(as.list(Genotypes))]
kable(Genotypes[1:10,1:10])

FID	IID	PHENOTYPE	1:8120561:TATGAAGATCA:T_T	1:12231268:G:A_A	1:12260345:TCC:TG_TG	1:12289164:CATCAACTG:GGGCCAC_GGGCCAC
295	Pan_troglodytes_ThisStudy	-0.502512	0	0	1	1
317	Pan_troglodytes_ThisStudy	-0.933389	0	1	1	1
338	Pan_troglodytes_ThisStudy	-1.082890	0	0	1	1
389	Pan_troglodytes_ThisStudy	-0.779304	0	0	1	1
438	Pan_troglodytes_ThisStudy	-0.871399	0	0	1	1
456	Pan_troglodytes_ThisStudy	-0.515976	0	0	1	1
462	Pan_troglodytes_ThisStudy	-0.674267	0	0	1	1
476	Pan_troglodytes_ThisStudy	-1.087490	1	0	1	1
495	Pan_troglodytes_ThisStudy	0.220741	0	0	1	1
4x0025	Pan_troglodytes_ThisStudy	-0.796774	0	0	1	1

#Make sure there aren't duplicate columns
length(colnames(Genotypes))

[1] 6619

length(unique(colnames(Genotypes)))

[1] 6619

# Read in eQTLs from MatrixEQTL output (already filtered for FDR<0.1)
eQTLs <- read.table("../data/20190410_eqtls.txt", header=T)
kable(head(eQTLs))

SNP	gene	beta	t.stat
1:223660942:GTGTTTATTTTAATATTGAG:GTGT	ENSPTRT00000096266.1	19.030028	35.83869
11:32876022:CTCCATCTC:CTCCATCTCA	ENSPTRT00000006565.4	-4.833762	-18.96827
11:32876022:CTCCATCTC:CTCCATCTCA	ENSPTRT00000006565.4	-4.833762	-18.96827
17:30190097:CCACGGCCCGCTAAC:CTGCACCTGGCTGAA	ENSPTRT00000084268.1	16.315581	14.89672
17:30191958:GGCAC:AGCGT	ENSPTRT00000084268.1	16.315581	14.89672
15:65870468:C:T	ENSPTRT00000013635.7	8.335836	14.76090

# Read in phenotypes, from count table
CountTable <- read.table(gzfile('../output/CountTable.tpm.txt.gz'), header=T, check.names=FALSE, row.names = 1) %>% 
  t() %>% 
  as.data.frame() %>% 
  rownames_to_column(var = "FID")
kable(CountTable[1:10, 1:10])

FID	ENSPTRT00000098376.1	ENSPTRT00000091526.1	ENSPTRT00000080032.1	ENSPTRT00000096913.1	ENSPTRT00000079752.1	ENSPTRT00000089496.1	ENSPTRT00000077419.1	ENSPTRT00000090315.1
4X0095	0.0377317	0.0000000	0.000000	0.0781316	0.0873579	0.000000	0.0000000	0.000000
4X0212	0.2722200	0.0000000	0.000000	0.0000000	0.2363460	0.000000	0.0000000	0.000000
4X0267	0.2832680	0.0000000	0.000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000
4X0333	0.0768681	0.0000000	0.000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000
4X0339	0.0241551	0.0279624	0.000000	0.0000000	0.0000000	0.000000	0.0676831	0.000000
4X0354	0.4893560	0.0000000	0.000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000
4X0357	0.1495150	0.0000000	0.000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000
4X0550	0.0338373	0.0000000	0.000000	0.0000000	0.0000000	0.000000	0.0000000	0.000000
4x0025	0.1593970	0.0000000	0.000000	0.0000000	0.0000000	0.176351	0.0000000	0.208957
4x0043	0.2715600	0.0000000	0.126819	0.0937206	0.7335140	0.000000	0.1268190	0.889987

MergedData <- left_join(Genotypes, CountTable, by="FID")

Warning: Column `FID` joining factor and character vector, coercing into
character vector

#eqtls, ordered from most significant at top
kable(head(eQTLs))

SNP	gene	beta	t.stat
1:223660942:GTGTTTATTTTAATATTGAG:GTGT	ENSPTRT00000096266.1	19.030028	35.83869
11:32876022:CTCCATCTC:CTCCATCTCA	ENSPTRT00000006565.4	-4.833762	-18.96827
11:32876022:CTCCATCTC:CTCCATCTCA	ENSPTRT00000006565.4	-4.833762	-18.96827
17:30190097:CCACGGCCCGCTAAC:CTGCACCTGGCTGAA	ENSPTRT00000084268.1	16.315581	14.89672
17:30191958:GGCAC:AGCGT	ENSPTRT00000084268.1	16.315581	14.89672
15:65870468:C:T	ENSPTRT00000013635.7	8.335836	14.76090

# 7682 eQTLs at FDR<0.1
dim(eQTLs)

[1] 7682    6

# betas for these
hist(eQTLs$beta)

# Expression box plot, stratified by genotype. For a fe of top of the list snp-gene pairs (most significant)
data.frame(Genotype = MergedData$`1:223660942:GTGTTTATTTTAATATTGAG:GTGT`,
    Phenotype = MergedData$ENSPTRT00000096266.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`17:30190097:CCACGGCCCGCTAAC:CTGCACCTGGCTGAA`,
    Phenotype = MergedData$ENSPTRT00000084268.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`15:65870468:C:T`,
    Phenotype = MergedData$ENSPTRT00000013635.7,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

Interpretation: This is concerning; these top hits seem to be all false positives, probably driven by genotyping errors as evidenced by a huge excess of heterozygotes. I checked the raw DNA-sequencing at some of these loci (not shown) and there doesn’t seem to be anything obviously wrong with the alignment or variant calling (~50% of non-duplicate, reads without any excess of mismatches and no strand bias indicate a mismatch/true snp at these locations). Paralogous genes can cause mismapping and false positive variant calls at positions like this. Probably the best way to filter them out will be to apply a Hardy-weinberg filter. This positions will deviate from HWE with very extreme Pvals, so applying a very stringent HWE filter (that still allows for some deviation from HWE, as expected because of population substructure and related individuals) may get rid of these problematic genotypes.

Let’s check some randomly sampled eqtls to see if all of the eqtls are like this.

set.seed(80)
RandomSampleOfEqtls <- eQTLs %>% sample_n(10) %>% select(SNP, gene, beta)
kable(RandomSampleOfEqtls)

SNP	gene	beta
3:64175024:T:C	ENSPTRT00000097126.1	-6.5789324
8:41999256:GCTGTGTTT:G	ENSPTRT00000076576.1	-0.8351890
3:50210729:C:T	ENSPTRT00000098400.1	-1.7838139
12:65459594:T:G	ENSPTRT00000048028.3	-2.3928804
12:65707861:T:C	ENSPTRT00000048028.3	-1.9414605
7:145137540:C:T	ENSPTRT00000036621.4	-5.5667535
1:131341766:A:C	ENSPTRT00000092453.1	-0.9729193
12:66854411:A:G	ENSPTRT00000048028.3	-1.9414605
12:64907042:G:A	ENSPTRT00000048028.3	-1.9414605
3:157318426:G:A	ENSPTRT00000029000.5	1.0544365

data.frame(Genotype = MergedData$`1:53393622:A:G`,
    Phenotype = MergedData$ENSPTRT00000108879.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`11:85739664:G:A`,
    Phenotype = MergedData$ENSPTRT00000089773.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`2A:70424303:C:G`,
    Phenotype = MergedData$ENSPTRT00000079798.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`16:58460835:C:T`,
    Phenotype = MergedData$ENSPTRT00000087116.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`10:124136336:C:A`,
    Phenotype = MergedData$ENSPTRT00000103391.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`16:58398631:C:A`,
    Phenotype = MergedData$ENSPTRT00000087116.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

# same gene as previous, snp in LD with previous plot
data.frame(Genotype = MergedData$`16:59259665:G:C`,
    Phenotype = MergedData$ENSPTRT00000087116.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`15:32018636:G:A`,
    Phenotype = MergedData$ENSPTRT00000081598.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`12:66710269:G:A`,
    Phenotype = MergedData$ENSPTRT00000048028.3,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

data.frame(Genotype = MergedData$`1:131341766:A:C`,
    Phenotype = MergedData$ENSPTRT00000092453.1,
    FID=MergedData$FID) %>%
ggplot(aes(x=factor(Genotype), y=Phenotype, label=FID)) +
geom_boxplot(outlier.shape = NA) +
geom_text(position=position_jitter(width=0.25), alpha=1, size=2) +
scale_y_continuous(name="TPM", trans="log10")

Interpretation: At least the first of the randomly selected eqtls seems reasonable. The others seem to be false positives, often driven by gene expression measurement for the MD_And sample. Perhaps this sample should be discarded completely, or alternatively some filtering for outlier gene expression measurements should be implemented on a gene-by-gene basis. This is not an extreme outlier for gene expression on the first PC plots, so another explanation is that this sample might be an outlier in the sense that it has very low heterozygosity, resulting in false positive calls at the gene-snp pairs for which this is the sole genotype2 (alt-homozygote) and for which this sample is a gene expression outlier.

Considerations for future: apply a hwe filter, check heterozygosity of samples, consider removing expression outliers on gene-by-gene basis and or normalization methods (eg quantile normalization would also dampen this outlier effect).

In the code block below I will choose a P-value filter for filtering genotypes by hwe as implemented with plink --hwe midp. The data comes from a subset of 1M snps (that passed undergone MAF>0.5 and Geno>0.9 filters) and their HWE pvals calculated with plink. I am looking for a threshold that takes care of these false positive calls that have an excess of heterozygotes

hwe.results <- read.table(gzfile('../data/20190416_hwe_tests.uncorrupted.txt.gz'), header=T) %>%
  separate(GENO, c("HomA1", "Het", "HomA2"), sep="/", convert=T)
kable(head(hwe.results))

SNP	HomA1	Het	HomA2	P
1:81:A:AGT	0	32	7	0.0000046
1:116:T:C	0	4	35	0.5387000
1:2849:TCAT:GCAC	1	9	29	0.3495000
1:2969:G:A	0	18	21	0.1249000
1:2991:C:T	0	20	19	0.0245700
1:3693:G:C	1	4	34	0.0973800

hist(hwe.results$P)

#qqplot of Pvals
qplot(-log10(1:length(hwe.results$P)/length(hwe.results$P)), -log10(sort(hwe.results$P, decreasing = F))) +
  geom_abline(intercept = 0, slope = 1)

# plot -log10(P) vs num heterozygotes. Plot as hex with log(counts) to see density of points
ggplot(hwe.results, aes(x=Het, y=-log10(P))) +
  stat_binhex(aes(fill=log10(..count..))) +
  xlab("# heterozygote individuals")

Warning: Computation failed in `stat_binhex()`:
Package `hexbin` required for `stat_binhex`.
Please install and try again.

Interpretation: the vast majority of SNPs would pass a HWE filter thresholded at -log10(P)<7.5. There are also a lot of SNPs that would fail this filter because of excess of heterozygotes. Not yet sure if those are genotyping errors or true snps that just don’t obey HWE (as might be expected with the relatedness and population structure in this cohort). Will stick with -log10(P)<7.5 as the threshold. This would filter out those eqtls I suggested were false positives at the beginning of this analysis (-log10(P) is about 10 for those snps).

Below I will check heterozygosity by samples. Used the following plink command: plink --het --maf 0.05 --bfile {MyFile} --geno --hwe 1e-7.5

The F statistic plink uses is: ( - ) / ( - ))

heterozygosity <- read.table('../data/20190416_sample.heterozygosity.txt', header=T)
kable(head(heterozygosity))

FID	IID	O.HOM.	E.HOM.	N.NM.	F
295	Pan_troglodytes_ThisStudy	7947394	7569000	10277099	0.13970
317	Pan_troglodytes_ThisStudy	7663288	7569000	10276542	0.03489
338	Pan_troglodytes_ThisStudy	7921992	7569000	10277191	0.13030
389	Pan_troglodytes_ThisStudy	7991032	7569000	10276645	0.15590
438	Pan_troglodytes_ThisStudy	7855254	7569000	10277192	0.10560
456	Pan_troglodytes_ThisStudy	7970770	7569000	10277040	0.14830

ggplot(heterozygosity, aes(x=FID, y=`F`)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5))

It is the case that MD_And sample has the highest homozygosity, though it’s F-statistic doesn’t stand out. On the basis of this, probably not justified to completely throw the sample out. Will consider phenotype filters on a gene-by-gene basis or quantile normalization to account for outliers that may drive false associations.

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.12.0 knitr_1.22        forcats_0.4.0    
 [4] stringr_1.4.0     dplyr_0.8.0.1     purrr_0.3.2      
 [7] readr_1.3.1       tidyr_0.8.2       tibble_2.1.1     
[10] ggplot2_3.1.0     tidyverse_1.2.1  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       highr_0.8        cellranger_1.1.0 plyr_1.8.4      
 [5] pillar_1.3.1     compiler_3.5.1   git2r_0.24.0     workflowr_1.2.0 
 [9] tools_3.5.1      digest_0.6.18    lubridate_1.7.4  jsonlite_1.6    
[13] evaluate_0.13    nlme_3.1-137     gtable_0.3.0     lattice_0.20-38 
[17] pkgconfig_2.0.2  rlang_0.3.3      cli_1.1.0        rstudioapi_0.10 
[21] yaml_2.2.0       haven_2.1.0      xfun_0.6         withr_2.1.2     
[25] xml2_1.2.0       httr_1.4.0       hms_0.4.2        generics_0.0.2  
[29] fs_1.2.6         rprojroot_1.3-2  grid_3.5.1       tidyselect_0.2.5
[33] glue_1.3.1       R6_2.4.0         readxl_1.1.0     rmarkdown_1.11  
[37] modelr_0.1.4     magrittr_1.5     whisker_0.3-2    backports_1.1.3 
[41] scales_1.0.0     htmltools_0.3.6  rvest_0.3.2      assertthat_0.2.1
[45] colorspace_1.4-1 labeling_0.3     stringi_1.4.3    lazyeval_0.2.2  
[49] munsell_0.5.0    broom_0.5.1      crayon_1.3.4