Last updated: 2019-08-21

Checks: 7 0

Knit directory: Comparative_eQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190319) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/20190521_eQTL_CrossSpeciesEnrichment_cache/
    Ignored:    analysis_temp/.DS_Store
    Ignored:    code/.DS_Store
    Ignored:    code/snakemake_workflow/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/PastAnalysesDataToKeep/.DS_Store
    Ignored:    docs/.DS_Store
    Ignored:    docs/assets/.DS_Store

Untracked files:
    Untracked:  analysis/20190821_eGeneTissueCount.Rmd
    Untracked:  docs/figure/20190821_eGeneTissueCount.Rmd/

Unstaged changes:
    Modified:   analysis/index.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd fdbae12 Benjmain Fair 2019-08-14 updates
Rmd e9747fe Benjmain Fair 2019-07-24 update site
Rmd f469d6a Benjmain Fair 2019-07-16 new analyses

For all the analyses I have using different levels of eGene character as a proxy for stabilizing selection (to get at what genes expression levels are important, and which are allowed to vary), I want to also do the same with coefficient of variation(CV) (as opposed to eGene character which requires that we can explain the variation by a cis-SNP). Even though CV does not require a heritable component to gene expression (which would therefore not be a target of stabilizing selection), it is more easy to measure CV than cis-eQTL mapping, and we know that 80% of gene expression heritability is trans anyway, so prioritizing stabilizing selection targets by presence of cis-eGene quality really only adds minimal evidence for heritability.

library(tidyverse)
library(knitr)
library("edgeR")
library(stats)
library(corrplot)
library(gplots)
library("clusterProfiler")
library("org.Hs.eg.db")
library(enrichplot)




# Helper function reference in body of later function
rep.col<-function(x,n){
   matrix(rep(x,each=n), ncol=n, byrow=TRUE)
}

# Modified from the function in the PowerAnalysis Rmarkdown.
# Function to return RPKM table from chimp and human datasets (n=38 each)
# Use GenesToKeep argument to subset the same list that I compared for eQTLs
GetRPKMCountTable <-function(ChimpCountTableFile, HumanCountTableFile, SubsampleSize, GenesToKeep, ChimpSampleDrop=NULL, HumanSampleDrop=NULL)
  #if SubsampleSize parameter == 0, use full table, otherwise, subsample from it
  {
   FullChimpData <- read.table(gzfile(ChimpCountTableFile), header=T, check.names=FALSE, skip=1)
   FullHumanData <- read.table(gzfile(HumanCountTableFile), header=T, check.names=FALSE, skip=1)
   
  if (!is.null(ChimpSampleDrop)){
     FullChimpData <- FullChimpData %>% dplyr::select(-ChimpSampleDrop)
   }   
  if (!is.null(HumanSampleDrop)){
     FullHumanData <- FullHumanData %>% dplyr::select(-HumanSampleDrop)
   }

   if (SubsampleSize==0){
     CountTableChimp <- FullChimpData
     colnames(CountTableChimp) <- paste0("C.", colnames(CountTableChimp))
     CountTableHuman <- FullHumanData
     colnames(CountTableHuman) <- paste0("H.", colnames(CountTableHuman))
     
   } else {
     CountTableChimp <- FullChimpData %>% dplyr::select(c(1:6, sample(7:length(FullChimpData), SubsampleSize)))
     colnames(CountTableChimp) <- paste0("C.", colnames(CountTableChimp))
     
     CountTableHuman <- FullHumanData %>% dplyr::select(c(1:6, sample(7:length(FullHumanData), SubsampleSize)))
     colnames(CountTableHuman) <- paste0("H.", colnames(CountTableHuman))
   }
   
CombinedTable <- inner_join(CountTableChimp[,c(1,7:length(CountTableChimp))], CountTableHuman[,c(1,7:length(CountTableHuman))], by=c("C.Geneid"="H.Geneid")) %>%
  column_to_rownames("C.Geneid") %>% as.matrix()

SpeciesFactor <- colnames(CombinedTable) %>% substr(1,1) %>% factor() %>% unclass() %>% as.character()

d0 <- DGEList(CombinedTable)
d0 <- calcNormFactors(d0)



d <- d0[GenesToKeep,]
mm <- model.matrix(~0 + SpeciesFactor)
y <- voom(d, mm, normalize.method="cyclicloess", plot=F)

GeneLengths <- inner_join(CountTableChimp[,c("C.Geneid", "C.Length")], CountTableHuman[,c("H.Geneid", "H.Length")], by=c("C.Geneid"="H.Geneid"))
GeneLengthMatrix <- cbind(
  rep.col(log2(GeneLengths$C.Length/1000), length(CountTableChimp)-6),
  rep.col(log2(GeneLengths$H.Length/1000), length(CountTableHuman)-6))
rownames(GeneLengthMatrix) <- GeneLengths$C.Geneid
y$E <- y$E - GeneLengthMatrix[rownames(y$E),]

return(y)
# return(d0)

}

Get list of cis-eQTL tested genes that were tested in both chimps and humans and one-to-one orthologs and also in DE gene count tables (from reads mapped to ortho exons):

CountTableChimpFile <- '../output/PowerAnalysisFullCountTable.Chimp.subread.txt.gz'
CountTableHumanFile <- '../output/PowerAnalysisFullCountTable.Human.subread.txt.gz'

eQTLs <- read.table(gzfile("../data/PastAnalysesDataToKeep/20190521_eQTLs_250kB_10MAF.txt.gz"), header=T)

# List of chimp tested genes
ChimpTestedGenes <- rownames(read.table('../output/ExpressionMatrix.un-normalized.txt.gz', header=T, check.names=FALSE, row.names = 1))

ChimpToHumanGeneMap <- read.table("../data/Biomart_export.Hsap.Ptro.orthologs.txt.gz", header=T, sep='\t', stringsAsFactors = F)

# Of this ortholog list, how many genes are one2one
OneToOneMap <- ChimpToHumanGeneMap %>%
  filter(Chimpanzee.homology.type=="ortholog_one2one")

# Read gtex heart egene list
# Only consider those that were tested in both species and are one2one orthologs
GtexHeartEgenes <- read.table("../data/Heart_Left_Ventricle.v7.egenes.txt.gz", header=T, sep='\t', stringsAsFactors = F) %>%
  mutate(gene_id_stable = gsub(".\\d+$","",gene_id)) %>%
  filter(gene_id_stable %in% OneToOneMap$Gene.stable.ID) %>%
  mutate(chimp_id = plyr::mapvalues(gene_id_stable, OneToOneMap$Gene.stable.ID,  OneToOneMap$Chimpanzee.gene.stable.ID, warn_missing = F)) %>%
  filter(chimp_id %in% ChimpTestedGenes)


EgenesTested <- gsub("\\..+", "", GtexHeartEgenes$gene_id, perl=T)
length(EgenesTested)
[1] 11586
GenesInDESet <- read.table(gzfile(CountTableChimpFile), header=T, check.names=FALSE, skip=1)$Geneid
length(GenesInDESet)
[1] 44125
GeneList <- intersect(as.character(GenesInDESet),EgenesTested)
kable(head(GeneList))
x
ENSG00000186827
ENSG00000078808
ENSG00000176022
ENSG00000184163
ENSG00000160087
ENSG00000131584
length(GeneList)
[1] 11416

Ok, now get CountTable of 38 chimps, 38 humans (based on remapped data that maps to orthologous exons) of RPKM, and filtered for the outlier samples that I also left out of the DE gene power analysis

HumanSamplesToDrop <- c(c("SRR1507229","SRR603918", "SRR1478149", "SRR598509", "SRR613186"), c("SRR1489693", "SRR598148", "59167", "SRR1478900", "SRR1474730", "61317"))
ChimpSamplesToDrop <- c("Little_R")




CountTable <- GetRPKMCountTable(CountTableChimpFile,
              CountTableHumanFile,
              0, GeneList, ChimpSampleDrop=ChimpSamplesToDrop, HumanSampleDrop = HumanSamplesToDrop)

qplot(apply(CountTable,1,mean), sqrt(apply(CountTable,1,var)), alpha=0.05) +
  geom_smooth(method="loess", show_guide = FALSE, se=F) +
  xlab("Mean expression (log(RPKM)") +
  ylab("Standard deviation") +
  theme_bw() +
  theme(legend.position = "none")

scatter.smooth(apply(CountTable,1,mean), sqrt(apply(CountTable,1,var)), color="red")

CountTableNonLog <- 2**CountTable$E


qplot(apply(CountTableNonLog,1,mean), sqrt(apply(CountTableNonLog,1,var)/apply(CountTableNonLog,1,mean)), alpha=0.01)

ToPlot <- data.frame(CV=sqrt(apply(CountTableNonLog,1,var)/apply(CountTableNonLog,1,mean)),
                     mean = apply(CountTableNonLog,1,mean), 
                     SDlogexpression = sqrt(apply(CountTable,1,var)),
                     logmean = apply(CountTable,1,mean))

# standard deviation vs expression (after log transforming RPKM), with CV as colors
ggplot(ToPlot, aes(x=log(mean), y=SDlogexpression, color=log(CV))) +
  geom_point()

Coefficient of variation still doesn’t really capture the character I am going after (since all the high CV genes are highly expressed), which is how much does a gene vary relative to other genes at its expression level. For now (for simplicity) I may simply use the standard deviation (after log transformation as my metric to get at stabilizing selection)

Below I will try fitting a loess mean-variance trend and rank genes according the how elevated they are above that trend

Gene.summarystats <- data.frame(CV=sqrt(apply(CountTableNonLog,1,var)/apply(CountTableNonLog,1,mean)),
                     mean = apply(CountTableNonLog,1,mean), 
                     SDlogexpression = sqrt(apply(CountTable,1,var)),
                     logmean = apply(CountTable,1,mean))

Gene.summarystats.lo <- loess(SDlogexpression ~ logmean, Gene.summarystats)
Gene.summarystats$SD.minus.loess <- Gene.summarystats$SDlogexpression - predict(Gene.summarystats.lo, Gene.summarystats$logmean)
ggplot(Gene.summarystats, aes(x=logmean, y=SDlogexpression, color=SD.minus.loess)) +
  geom_point() +
  geom_smooth(method="loess", show_guide = FALSE, se=F) +
  xlab("Mean expression (log(RPKM)") +
  ylab("Standard deviation") +
  scale_colour_gradient2(name = "Residual from loess curve") +
  theme_bw() +
  theme(legend.position="bottom")

Alright that worked well, now do it for chimp, and again for human, and compare…

Get.GeneSummaryStat.df <- function(Input.df){
  My.Gene.summarystats <- data.frame(mean = apply(Input.df,1,mean),
                                  SD = sqrt(apply(Input.df,1,var)))
  My.Gene.summarystats.lo <- loess(SD ~ mean, My.Gene.summarystats)
  My.Gene.summarystats$SD.minus.loess <- My.Gene.summarystats$SD - predict(My.Gene.summarystats.lo, My.Gene.summarystats$mean)
  return(My.Gene.summarystats)
}


CountTable.chimp <- CountTable$E %>% as.data.frame() %>% dplyr::select(contains("C."))
Chimp.summarystats <- Get.GeneSummaryStat.df(CountTable.chimp)

CountTable.human <- CountTable$E %>% as.data.frame() %>% dplyr::select(contains("H."))
Human.summarystats <- Get.GeneSummaryStat.df(CountTable.human)

# How does this statistic (std dev minus loess prediction) correlate across species
R<-cor(Chimp.summarystats$SD.minus.loess, Human.summarystats$SD.minus.loess, method="pearson")
lb1 <- paste("~R^2==~", round(R**2,2))
qplot(Chimp.summarystats$SD.minus.loess, Human.summarystats$SD.minus.loess, alpha=0.05) +
  xlab("Chimp") +
  ylab("Human")+
  annotate("text",x=Inf,y=-Inf, label=lb1, hjust=1, vjust=-1, parse=TRUE) +
  theme_bw() +
  theme(legend.position = "none")

That correlation is surprisingly good. I should check that it isn’t a bug due to sample labels getting switched. Lets make correlation matrix of the original table to ensure that the chimp samples separate from the human samples.

cor(CountTable$E, method = c("spearman")) %>%
heatmap.2(trace="none")

Ok wow, I guess it really is the case that gene expression variance is very consistent. If there really isn’t a bug, and this statistic is measuring what it should, I expect it to correlate strongly to other measures of conservation like dN/dS or percent identity

ChimpToHumanGeneMap <- read.table("../data/Biomart_export.Hsap.Ptro.orthologs.txt.gz", header=T, sep='\t', stringsAsFactors = F) %>% distinct(Gene.stable.ID, .keep_all = T)


ToPlot <- Chimp.summarystats %>%
  rownames_to_column() %>%
  left_join(ChimpToHumanGeneMap, by=c("rowname" = "Gene.stable.ID")) %>%
  mutate(rank = dense_rank(dplyr::desc(SD.minus.loess))) %>% 
  mutate(group = case_when(
                           rank>=500 ~ "NotTop500VarianceGene",
                           rank<=500 ~ "Top500VarianceGene")) %>%
  mutate(dN.dS = dN.with.Chimpanzee/dS.with.Chimpanzee)

ggplot(ToPlot, aes(color=group, x=100-X.id..query.gene.identical.to.target.Chimpanzee.gene+0.001)) +
  stat_ecdf(geom = "step") +
  scale_x_continuous(trans='log1p', limits=c(0,100), expand=expand_scale()) +
  ylab("Cumulative frequency") +
  xlab("Percent non-identitical amino acid between chimp and human") +
  annotate("text", x = 10, y = 0.4, label = paste("Mann-Whitney\none-sided P =", signif(wilcox.test(data=ToPlot, X.id..query.gene.identical.to.target.Chimpanzee.gene ~ group, alternative="greater")$p.value, 2) )) +
  theme_bw() +
  theme(legend.position = c(.80, .2), legend.title=element_blank())

#What is overall correlation (previous analysis, where I compared two groups based on a threshold categorization of the other variable, is not the most sensitive method)
cor.test(ToPlot$X.id..query.gene.identical.to.target.Chimpanzee.gene, ToPlot$SD.minus.loess, method='spearman')

    Spearman's rank correlation rho

data:  ToPlot$X.id..query.gene.identical.to.target.Chimpanzee.gene and ToPlot$SD.minus.loess
S = 2.667e+11, p-value = 6.336e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
        rho 
-0.07554831 
ggplot(ToPlot, aes(y=SD.minus.loess, x=100-X.id..query.gene.identical.to.target.Chimpanzee.gene+0.001)) +
  geom_point() +
  scale_x_continuous(trans='log1p', limits=c(0,100), expand=expand_scale()) +
  ylab("SD.minus.loess") +
  xlab("Percent non-identitical amino acid between chimp and human") +
  theme_bw() +
  theme(legend.position = c(.80, .2), legend.title=element_blank())

ggplot(ToPlot, aes(color=group,x=dN.dS)) +
  stat_ecdf(geom = "step") +
  ylab("Cumulative frequency") +
  xlab("dN/dS") +
  scale_x_continuous(trans='log10', limits=c(0.01,10)) +
  annotate("text", x = 1, y = 0.4, label = paste("Mann-Whitney\none-sided P =", signif(wilcox.test(data=ToPlot, dN.dS ~ group, alternative="less")$p.value, 2) )) +
  theme_bw() +
  theme(legend.position = c(.80, .2), legend.title=element_blank())

Ok now subtract the variance metric (Chimp - human) to get a ordered list of genes where higher numbers means more variance in chimp.

RankedGeneList<-Chimp.summarystats$SD.minus.loess - Human.summarystats$SD.minus.loess
names(RankedGeneList) <- rownames(Chimp.summarystats)
SortedGeneList <- sort(RankedGeneList, decreasing=T)

GSEA analysis

#bp  
gsego.cc.chimp.high.var <- gseGO(gene = SortedGeneList,
                 OrgDb = org.Hs.eg.db,
                 keyType = 'ENSEMBL',
                 maxGSSize = 500,
                 ont = "BP",
                 nPerm = 100000)
A<-as.data.frame(gsego.cc.chimp.high.var)
bp2 <- simplify(gsego.cc.chimp.high.var, cutoff=0.7, by="p.adjust", select_fun=min)

gseaplot2(gsego.cc.chimp.high.var, geneSetID = c("GO:0002263", "GO:0061061", "GO:0072538", "GO:0000076"), title="ChimpVariance - HumanVariance", pvalue_table = TRUE)

dotplot(bp2, font.size=8, showCategory=10)

#all
gsego.all.chimp.high.var <- gseGO(gene = SortedGeneList,
                 OrgDb = org.Hs.eg.db,
                 keyType = 'ENSEMBL',
                 maxGSSize = 500,
                 ont = "ALL",
                 nPerm = 100000)
A<-as.data.frame(gsego.all.chimp.high.var)
A %>% as.data.frame() %>% dplyr::select(ONTOLOGY, ID, Description, enrichmentScore) %>% arrange(desc(abs(enrichmentScore))) %>% head(20)
   ONTOLOGY         ID
1        CC GO:0042101
2        BP GO:0045059
3        CC GO:0071682
4        CC GO:0042611
5        BP GO:0090036
6        BP GO:0000076
7        MF GO:0042287
8        BP GO:0033081
9        BP GO:0046641
10       BP GO:0045076
11       BP GO:0043368
12       MF GO:0022829
13       BP GO:0034356
14       BP GO:0032753
15       BP GO:0072538
16       MF GO:0042288
17       CC GO:0042613
18       BP GO:0032673
19       BP GO:0045061
20       BP GO:2000316
                                                  Description
1                                     T cell receptor complex
2                            positive thymic T cell selection
3                                     endocytic vesicle lumen
4                                         MHC protein complex
5                    regulation of protein kinase C signaling
6                                  DNA replication checkpoint
7                                         MHC protein binding
8              regulation of T cell differentiation in thymus
9      positive regulation of alpha-beta T cell proliferation
10           regulation of interleukin-2 biosynthetic process
11                                  positive T cell selection
12                                 wide pore channel activity
13 NAD biosynthesis via nicotinamide riboside salvage pathway
14            positive regulation of interleukin-4 production
15                           T-helper 17 type immune response
16                                MHC class I protein binding
17                               MHC class II protein complex
18                     regulation of interleukin-4 production
19                                    thymic T cell selection
20             regulation of T-helper 17 type immune response
   enrichmentScore
1        0.8748100
2        0.8312196
3        0.8080349
4        0.8049838
5        0.8021597
6       -0.8013554
7        0.7915898
8        0.7888677
9        0.7872037
10       0.7864955
11       0.7844327
12       0.7826594
13       0.7786233
14       0.7740712
15       0.7729389
16       0.7673605
17       0.7664554
18       0.7643274
19       0.7596905
20       0.7589618
A %>% as.data.frame() %>% dplyr::select(ONTOLOGY, ID, Description, enrichmentScore) %>% arrange(desc(abs(enrichmentScore))) %>% head(20)
   ONTOLOGY         ID
1        CC GO:0042101
2        BP GO:0045059
3        CC GO:0071682
4        CC GO:0042611
5        BP GO:0090036
6        BP GO:0000076
7        MF GO:0042287
8        BP GO:0033081
9        BP GO:0046641
10       BP GO:0045076
11       BP GO:0043368
12       MF GO:0022829
13       BP GO:0034356
14       BP GO:0032753
15       BP GO:0072538
16       MF GO:0042288
17       CC GO:0042613
18       BP GO:0032673
19       BP GO:0045061
20       BP GO:2000316
                                                  Description
1                                     T cell receptor complex
2                            positive thymic T cell selection
3                                     endocytic vesicle lumen
4                                         MHC protein complex
5                    regulation of protein kinase C signaling
6                                  DNA replication checkpoint
7                                         MHC protein binding
8              regulation of T cell differentiation in thymus
9      positive regulation of alpha-beta T cell proliferation
10           regulation of interleukin-2 biosynthetic process
11                                  positive T cell selection
12                                 wide pore channel activity
13 NAD biosynthesis via nicotinamide riboside salvage pathway
14            positive regulation of interleukin-4 production
15                           T-helper 17 type immune response
16                                MHC class I protein binding
17                               MHC class II protein complex
18                     regulation of interleukin-4 production
19                                    thymic T cell selection
20             regulation of T-helper 17 type immune response
   enrichmentScore
1        0.8748100
2        0.8312196
3        0.8080349
4        0.8049838
5        0.8021597
6       -0.8013554
7        0.7915898
8        0.7888677
9        0.7872037
10       0.7864955
11       0.7844327
12       0.7826594
13       0.7786233
14       0.7740712
15       0.7729389
16       0.7673605
17       0.7664554
18       0.7643274
19       0.7596905
20       0.7589618
# bp2 <- simplify(gsego.all.chimp.high.var, cutoff=0.7, by="p.adjust", select_fun=min)

# bp2 %>% as.data.frame() %>% dplyr::select(ONTOLOGY, ID, Description, enrichmentScore) %>% arrange(desc(abs(enrichmentScore))) %>% head(20)



table(A$ONTOLOGY)

 BP  CC  MF 
336  32  16 
A$ID
  [1] "GO:0032101" "GO:0030155" "GO:0002263" "GO:0001816" "GO:0045087"
  [6] "GO:0002366" "GO:0061061" "GO:0009986" "GO:0051046" "GO:0002274"
 [11] "GO:0019221" "GO:0050778" "GO:0006954" "GO:0031347" "GO:1903530"
 [16] "GO:0046649" "GO:0001817" "GO:0002444" "GO:0002275" "GO:0043299"
 [21] "GO:0036230" "GO:0042119" "GO:0050865" "GO:0002253" "GO:0009617"
 [26] "GO:0002694" "GO:0002521" "GO:0002764" "GO:0098552" "GO:0042110"
 [31] "GO:0002757" "GO:0051249" "GO:1903706" "GO:0022407" "GO:0098542"
 [36] "GO:0002697" "GO:0050727" "GO:0002250" "GO:0007159" "GO:0030098"
 [41] "GO:0002768" "GO:0050867" "GO:0050863" "GO:0002696" "GO:0002429"
 [46] "GO:1903037" "GO:0051251" "GO:1902105" "GO:0060326" "GO:0070661"
 [51] "GO:0022409" "GO:0034612" "GO:0009897" "GO:0030217" "GO:0002460"
 [56] "GO:0032943" "GO:0046651" "GO:0042113" "GO:0002449" "GO:1903039"
 [61] "GO:0050870" "GO:0070663" "GO:0002699" "GO:0032944" "GO:0050851"
 [66] "GO:0050670" "GO:0034341" "GO:1903708" "GO:0002703" "GO:0030595"
 [71] "GO:0051092" "GO:0071346" "GO:0045619" "GO:0050852" "GO:0002819"
 [76] "GO:1902107" "GO:0045580" "GO:0046631" "GO:0006959" "GO:0032946"
 [81] "GO:0050671" "GO:0001906" "GO:0046632" "GO:0060333" "GO:0045621"
 [86] "GO:0001909" "GO:0034340" "GO:0045582" "GO:0042102" "GO:0060337"
 [91] "GO:0071357" "GO:0033077" "GO:0003823" "GO:0045058" "GO:0032633"
 [96] "GO:0032673" "GO:0043368" "GO:0042287" "GO:0042101" "GO:0004888"
[101] "GO:0002446" "GO:0002283" "GO:0043312" "GO:0098797" "GO:0007052"
[106] "GO:0045785" "GO:0001819" "GO:0018212" "GO:0050900" "GO:0018108"
[111] "GO:0031349" "GO:0052547" "GO:0052548" "GO:0050730" "GO:0098802"
[116] "GO:0042742" "GO:0070665" "GO:0002706" "GO:0002705" "GO:0048525"
[121] "GO:0046637" "GO:0001910" "GO:0002287" "GO:0002293" "GO:0002294"
[126] "GO:0072376" "GO:0042093" "GO:0006956" "GO:0001784" "GO:0032743"
[131] "GO:0042611" "GO:0045088" "GO:0071356" "GO:0005539" "GO:0050792"
[136] "GO:0002822" "GO:0030317" "GO:0097722" "GO:0046634" "GO:0072562"
[141] "GO:0043367" "GO:0050853" "GO:0045622" "GO:0042692" "GO:1902850"
[146] "GO:0045089" "GO:0101002" "GO:0017157" "GO:0042129" "GO:0048514"
[151] "GO:0005509" "GO:0001525" "GO:0002286" "GO:0007517" "GO:0002683"
[156] "GO:0031341" "GO:0046635" "GO:0002292" "GO:0010469" "GO:0019730"
[161] "GO:0042098" "GO:0002526" "GO:0045445" "GO:0071621" "GO:1903901"
[166] "GO:0032663" "GO:0033081" "GO:0050729" "GO:0097530" "GO:0019724"
[171] "GO:0032103" "GO:0006027" "GO:0032609" "GO:0007051" "GO:0031294"
[176] "GO:0072538" "GO:0030545" "GO:0031012" "GO:0006026" "GO:0002532"
[181] "GO:0002709" "GO:0002285" "GO:0002708" "GO:0043903" "GO:0000076"
[186] "GO:0000819" "GO:0031295" "GO:0048018" "GO:0032753" "GO:0071222"
[191] "GO:0046641" "GO:0050866" "GO:0032623" "GO:0050864" "GO:0007059"
[196] "GO:0010035" "GO:0060205" "GO:0043235" "GO:0000075" "GO:0001568"
[201] "GO:0097529" "GO:0050731" "GO:0002695" "GO:0001818" "GO:0031983"
[206] "GO:0098813" "GO:0000070" "GO:0016064" "GO:0140014" "GO:0002820"
[211] "GO:0043044" "GO:0099003" "GO:0043408" "GO:0005819" "GO:0072539"
[216] "GO:0043370" "GO:0005874" "GO:0001912" "GO:0000793" "GO:0000280"
[221] "GO:0001772" "GO:1903900" "GO:0014706" "GO:0009435" "GO:0045059"
[226] "GO:0042035" "GO:0030162" "GO:1901222" "GO:0090036" "GO:0045061"
[231] "GO:0031343" "GO:0071219" "GO:0016079" "GO:2000116" "GO:0060538"
[236] "GO:0035710" "GO:0009636" "GO:0032496" "GO:0045071" "GO:0030139"
[241] "GO:0002823" "GO:0070670" "GO:0002673" "GO:0060537" "GO:0045309"
[246] "GO:0042330" "GO:0055001" "GO:0070820" "GO:0019955" "GO:0002790"
[251] "GO:0055002" "GO:0001539" "GO:0060285" "GO:0071556" "GO:0098553"
[256] "GO:1903305" "GO:0017156" "GO:1902749" "GO:0071353" "GO:0032735"
[261] "GO:0051250" "GO:0002707" "GO:0030101" "GO:2000514" "GO:0045076"
[266] "GO:0008037" "GO:0002237" "GO:0006935" "GO:0032649" "GO:0042094"
[271] "GO:0042108" "GO:0002228" "GO:0042089" "GO:0042107" "GO:0051301"
[276] "GO:0002931" "GO:0032613" "GO:0055069" "GO:0060047" "GO:0032653"
[281] "GO:0070372" "GO:0048489" "GO:0097480" "GO:0051146" "GO:0019674"
[286] "GO:0002455" "GO:0071682" "GO:0043901" "GO:0034356" "GO:0046633"
[291] "GO:0070371" "GO:0030449" "GO:2000257" "GO:0002920" "GO:1990266"
[296] "GO:0045576" "GO:0030593" "GO:1904724" "GO:0071216" "GO:0006336"
[301] "GO:2000316" "GO:0009611" "GO:0002824" "GO:0032733" "GO:0038061"
[306] "GO:0097479" "GO:0046638" "GO:0010942" "GO:1904813" "GO:0051303"
[311] "GO:0051310" "GO:2001233" "GO:0072525" "GO:0043410" "GO:0002821"
[316] "GO:0050000" "GO:0002456" "GO:0042288" "GO:0034724" "GO:0032692"
[321] "GO:0007186" "GO:0042267" "GO:0007080" "GO:0042613" "GO:0051091"
[326] "GO:0045121" "GO:0003015" "GO:0050830" "GO:0090307" "GO:0044818"
[331] "GO:0042100" "GO:0032691" "GO:0033628" "GO:0046596" "GO:0062023"
[336] "GO:0002702" "GO:0031570" "GO:0008016" "GO:1901342" "GO:0030414"
[341] "GO:0001501" "GO:0000302" "GO:0000779" "GO:1903539" "GO:0046640"
[346] "GO:0051090" "GO:0045505" "GO:0098589" "GO:0000226" "GO:0019319"
[351] "GO:0005796" "GO:0032418" "GO:0006094" "GO:0032729" "GO:0002704"
[356] "GO:1903522" "GO:0009306" "GO:0034080" "GO:0061641" "GO:0002886"
[361] "GO:0034728" "GO:0002758" "GO:0002793" "GO:0002711" "GO:0006882"
[366] "GO:0007093" "GO:0098857" "GO:0007017" "GO:0010466" "GO:0001892"
[371] "GO:0022829" "GO:0045335" "GO:0004715" "GO:0046364" "GO:0050663"
[376] "GO:1900015" "GO:0017171" "GO:0045064" "GO:0030183" "GO:0019359"
[381] "GO:0019363" "GO:0035821" "GO:0097237" "GO:0043062"

Keep in mind GSEA finds both enrichment at both top and bottom of the list. MHC complex is on here, with higher enrichment in chimp

CC.gsea <- gseGO(gene = SortedGeneList,
                 OrgDb = org.Hs.eg.db,
                 keyType = 'ENSEMBL',
                 maxGSSize = 500,
                 ont = "CC",
                 nPerm = 100000)

BP.gsea <- gseGO(gene = SortedGeneList,
                 OrgDb = org.Hs.eg.db,
                 keyType = 'ENSEMBL',
                 maxGSSize = 500,
                 ont = "BP",
                 nPerm = 1000000)

MF.gsea <- gseGO(gene = SortedGeneList,
                 OrgDb = org.Hs.eg.db,
                 keyType = 'ENSEMBL',
                 maxGSSize = 500,
                 ont = "MF",
                 nPerm = 100000)

CC.gsea.simplified <- as.data.frame(simplify(CC.gsea))
CC.gsea.simplified$OntologyCategory <- "Cellular.Component"
BP.gsea.simplified <- as.data.frame(simplify(BP.gsea))
BP.gsea.simplified$OntologyCategory <- "Biological.Process"
MF.gsea.simplified <- as.data.frame(simplify(MF.gsea))
MF.gsea.simplified$OntologyCategory <- "Molecular.Function"

Combined <- rbind(
      CC.gsea.simplified,
      BP.gsea.simplified,
      MF.gsea.simplified)

Combined %>% 
  group_by(OntologyCategory) %>%
  top_n(n = 5, wt = abs(enrichmentScore)) %>%
  ungroup() %>%
  ggplot(aes(x=enrichmentScore, y=Description, color=p.adjust, size=setSize)) +
  geom_point() +
  xlim(c(-1,1)) +
  facet_grid(OntologyCategory~., scales = "free") +
  scale_colour_gradient(low="red", high="black") +
  facet_grid(OntologyCategory~., scales = "free") +
  labs(color = "Adjusted P-value") +
  theme_bw()

Combined %>% 
  group_by(OntologyCategory) %>%
  top_n(n = -7, wt = qvalues) %>%
  top_n(n = 7, wt = setSize) %>%
  ungroup() %>%
  # group_by(OntologyCategory) %>%
  # sample_n(8) %>%
  ggplot(aes(x=enrichmentScore, y=Description, size=setSize)) +
  geom_point() +
  xlim(c(-1,1)) +
  facet_grid(OntologyCategory~., scales = "free") +
  # scale_colour_gradient(low="red", high="black") +
  # labs(color = "Adjusted P-value") +
  theme_bw()

In addition to GSEA analysis, it might be worthwile to also try to correlate difference in variance metric to chromatin-interaction score:

SampleA<- read.csv(gzfile("../data/IttaiHomerInteractionScoresInCisWindows/adj_bins_25_A-21792_10kb_norm.gz"), sep='\t')
SampleB<- read.csv(gzfile("../data/IttaiHomerInteractionScoresInCisWindows/adj_bins_25_B-28126_10kb_norm.gz"), sep='\t')
SampleC<- read.csv(gzfile("../data/IttaiHomerInteractionScoresInCisWindows/adj_bins_25_C-3649_10kb_norm.gz"), sep='\t')
SampleD<- read.csv(gzfile("../data/IttaiHomerInteractionScoresInCisWindows/adj_bins_25_D-40300_10kb_norm.gz"), sep='\t')
SampleE<- read.csv(gzfile("../data/IttaiHomerInteractionScoresInCisWindows/adj_bins_25_E-28815_10kb_norm.gz"), sep='\t')
SampleF<- read.csv(gzfile("../data/IttaiHomerInteractionScoresInCisWindows/adj_bins_25_F-28834_10kb_norm.gz"), sep='\t')
SampleG<- read.csv(gzfile("../data/IttaiHomerInteractionScoresInCisWindows/adj_bins_25_G-3624_10kb_norm.gz"), sep='\t')
SampleH<- read.csv(gzfile("../data/IttaiHomerInteractionScoresInCisWindows/adj_bins_25_H-3651_10kb_norm.gz"), sep='\t')


ChimpToHumanGeneMap <- read.table("../data/Biomart_export.Hsap.Ptro.orthologs.txt.gz", header=T, sep='\t', stringsAsFactors = F)
kable(head(ChimpToHumanGeneMap))
Gene.stable.ID Transcript.stable.ID Chimpanzee.gene.stable.ID Chimpanzee.gene.name Chimpanzee.protein.or.transcript.stable.ID Chimpanzee.homology.type X.id..target.Chimpanzee.gene.identical.to.query.gene X.id..query.gene.identical.to.target.Chimpanzee.gene dN.with.Chimpanzee dS.with.Chimpanzee Chimpanzee.orthology.confidence..0.low..1.high.
ENSG00000198888 ENST00000361390 ENSPTRG00000042641 MT-ND1 ENSPTRP00000061407 ortholog_one2one 94.6541 94.6541 0.0267 0.5455 1
ENSG00000198763 ENST00000361453 ENSPTRG00000042626 MT-ND2 ENSPTRP00000061406 ortholog_one2one 96.2536 96.2536 0.0185 0.7225 1
ENSG00000210127 ENST00000387392 ENSPTRG00000042642 MT-TA ENSPTRT00000076396 ortholog_one2one 100.0000 100.0000 NA NA NA
ENSG00000198804 ENST00000361624 ENSPTRG00000042657 MT-CO1 ENSPTRP00000061408 ortholog_one2one 98.8304 98.8304 0.0065 0.5486 1
ENSG00000198712 ENST00000361739 ENSPTRG00000042660 MT-CO2 ENSPTRP00000061402 ortholog_one2one 97.7974 97.7974 0.0106 0.5943 1
ENSG00000228253 ENST00000361851 ENSPTRG00000042653 MT-ATP8 ENSPTRP00000061400 ortholog_one2one 94.1176 94.1176 0.0325 0.3331 1
# Of this ortholog list, how many genes are one2one
table(ChimpToHumanGeneMap$Chimpanzee.homology.type)

ortholog_many2many  ortholog_one2many   ortholog_one2one 
              2278              19917             140351 
OneToOneMap <- ChimpToHumanGeneMap %>%
  filter(Chimpanzee.homology.type=="ortholog_one2one")
ChimpToHuman.ID <- function(Chimp.ID){
  #function to convert chimp ensembl to human ensembl gene ids
  return(
    plyr::mapvalues(Chimp.ID, OneToOneMap$Chimpanzee.gene.stable.ID, OneToOneMap$Gene.stable.ID, warn_missing = F)
  )}

HumanInteractions <- data.frame(H.Score = base::rowSums(cbind(SampleA, SampleB, SampleE, SampleF))) %>%
  rownames_to_column() %>%
  mutate(HumanID = gsub("(.+?)\\..+?", "\\1", rowname, perl=T))


ChimpInteractions <- data.frame(C.Score = rowSums(cbind(SampleC, SampleD, SampleG, SampleH))) %>%
  rownames_to_column() %>%
  mutate(HumanID = ChimpToHuman.ID(rowname))


ToPlot <- data.frame(SpeciesVarianceDiff = RankedGeneList) %>%
  rownames_to_column() %>%
  left_join(HumanInteractions, by=c("rowname"="HumanID")) %>%
  left_join(ChimpInteractions, by=c("rowname"="HumanID")) %>% 
  mutate(InteractionDifference=H.Score - C.Score) %>%
  filter(!is.na(H.Score)) %>%
  filter(!is.na(C.Score))

ggplot(ToPlot, aes(x=SpeciesVarianceDiff, y=InteractionDifference)) +
  geom_point() +
  theme_bw() +
  xlab("Variation in expression\nTighter in human <--  --> Tighter in chimp") +
  ylab("Differential contacts in cis window\nMore in human <--  --> More in chimp") +
  geom_smooth(method='lm',formula=y~x)

cor.test(x=ToPlot$SpeciesVarianceDiff, y=ToPlot$InteractionDifference, method="spearman")

    Spearman's rank correlation rho

data:  ToPlot$SpeciesVarianceDiff and ToPlot$InteractionDifference
S = 5.4814e+10, p-value = 0.811
alternative hypothesis: true rho is not equal to 0
sample estimates:
         rho 
-0.002880632 
contacts.v.eGene.lm = lm(InteractionDifference ~ SpeciesVarianceDiff, data=ToPlot)
summary(contacts.v.eGene.lm)

Call:
lm(formula = InteractionDifference ~ SpeciesVarianceDiff, data = ToPlot)

Residuals:
    Min      1Q  Median      3Q     Max 
-207.59  -29.71   -2.91   28.14  336.83 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           5.3982     0.5857   9.217   <2e-16 ***
SpeciesVarianceDiff  -2.7851     3.0147  -0.924    0.356    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 48.63 on 6894 degrees of freedom
Multiple R-squared:  0.0001238, Adjusted R-squared:  -2.125e-05 
F-statistic: 0.8535 on 1 and 6894 DF,  p-value: 0.3556
plot(contacts.v.eGene.lm)

Not significant. Perhaps most intuitive explanation is to why this relationship was significant with eGenes (species difference in neighborhood chromatin contacts partly explains species difference in cis-eGene rank) but not for variance (species difference in neghborhood chromatin contacts does not significantly explain any difference in within- species variance) is that the chromatin contacts only mediate cis-variance, while 80% of expression variance is in trans.


sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] enrichplot_1.2.0       org.Hs.eg.db_3.7.0     AnnotationDbi_1.44.0  
 [4] IRanges_2.16.0         S4Vectors_0.20.1       Biobase_2.42.0        
 [7] BiocGenerics_0.28.0    clusterProfiler_3.10.1 gplots_3.0.1.1        
[10] corrplot_0.84          edgeR_3.24.3           limma_3.38.3          
[13] knitr_1.23             forcats_0.4.0          stringr_1.4.0         
[16] dplyr_0.8.1            purrr_0.3.2            readr_1.3.1           
[19] tidyr_0.8.3            tibble_2.1.3           ggplot2_3.1.1         
[22] tidyverse_1.2.1       

loaded via a namespace (and not attached):
 [1] fgsea_1.8.0         colorspace_1.4-1    ggridges_0.5.1     
 [4] rprojroot_1.3-2     qvalue_2.14.1       fs_1.3.1           
 [7] rstudioapi_0.10     farver_1.1.0        urltools_1.7.3     
[10] ggrepel_0.8.1       bit64_0.9-7         lubridate_1.7.4    
[13] xml2_1.2.0          splines_3.5.1       GOSemSim_2.8.0     
[16] polyclip_1.10-0     jsonlite_1.6        workflowr_1.4.0    
[19] broom_0.5.2         GO.db_3.7.0         ggforce_0.2.2      
[22] compiler_3.5.1      httr_1.4.0          rvcheck_0.1.3      
[25] backports_1.1.4     assertthat_0.2.1    Matrix_1.2-17      
[28] lazyeval_0.2.2      cli_1.1.0           tweenr_1.0.1       
[31] htmltools_0.3.6     prettyunits_1.0.2   tools_3.5.1        
[34] igraph_1.2.4.1      gtable_0.3.0        glue_1.3.1         
[37] reshape2_1.4.3      DO.db_2.9           fastmatch_1.1-0    
[40] Rcpp_1.0.1          cellranger_1.1.0    gdata_2.18.0       
[43] nlme_3.1-140        ggraph_1.0.2        xfun_0.7           
[46] rvest_0.3.4         gtools_3.8.1        DOSE_3.8.2         
[49] europepmc_0.3       MASS_7.3-51.4       scales_1.0.0       
[52] hms_0.4.2           RColorBrewer_1.1-2  yaml_2.2.0         
[55] memoise_1.1.0       gridExtra_2.3       UpSetR_1.4.0       
[58] triebeard_0.3.0     stringi_1.4.3       RSQLite_2.1.1      
[61] highr_0.8           caTools_1.17.1.2    BiocParallel_1.16.6
[64] rlang_0.3.4         pkgconfig_2.0.2     bitops_1.0-6       
[67] evaluate_0.14       lattice_0.20-38     labeling_0.3       
[70] cowplot_0.9.4       bit_1.1-14          tidyselect_0.2.5   
[73] plyr_1.8.4          magrittr_1.5        R6_2.4.0           
[76] generics_0.0.2      DBI_1.0.0           pillar_1.4.1       
[79] haven_2.1.0         whisker_0.3-2       withr_2.1.2        
[82] modelr_0.1.4        crayon_1.3.4        KernSmooth_2.23-15 
[85] rmarkdown_1.13      viridis_0.5.1       progress_1.2.2     
[88] locfit_1.5-9.1      grid_3.5.1          readxl_1.3.1       
[91] data.table_1.12.2   blob_1.1.1          git2r_0.25.2       
[94] digest_0.6.19       gridGraphics_0.4-1  munsell_0.5.0      
[97] viridisLite_0.3.0   ggplotify_0.0.3