Last updated: 2020-09-09

Checks: 6 1

Knit directory: Comparative_eQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20190319) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    WorkingManuscript.zip
    Ignored:    WorkingManuscript/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    analysis_temp/.DS_Store
    Ignored:    big_data/
    Ignored:    code/.DS_Store
    Ignored:    code/snakemake_workflow/.DS_Store
    Ignored:    code/snakemake_workflow/.Rhistory
    Ignored:    data/.DS_Store
    Ignored:    data/PastAnalysesDataToKeep/.DS_Store
    Ignored:    figures/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  analysis/20200907_Response_OriginalComments.Rmd
    Untracked:  analysis/20200907_Response_Point_11.Rmd

Unstaged changes:
    Modified:   analysis_temp/Ted_vignette.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.


Original reviewer point:

Have the authors tried estimating dispersion on top of what is expected based on differences in cell type? There are several strategies that might work for this: There are new strategies for estimating a posterior of cell type specific expression from a bulk sample, conditional on scRNA-seq data as prior information (Chu and Danko, bioRxiv, 2020). These cell type specific expression estimates could then be analyzed for dispersion. Alternatively, it may also work to regress the estimated proportion of each cell type out of the dispersion estimates. While there are certainly a lot of pitfalls with using these strategies, especially in the setting shown here (all of this would work better if there were species matched reference data), they might provide an avenue for depleting the contribution of cell type differences from dispersion estimates.

Here I will use the Chu and Danko methodology (“TED”, or “BayesPrism”, see https://github.com/Danko-Lab/TED) to estimate cell type specific expression estimates for each individual, and then estimate a cell type specific dispersion estimate. As the reviewer notes, there a lot of pitfalls with this strategy. Really, in the absence of scRNA-seq data (and a human scRNA-seq reference), there is only so much we can gain I think. But still worth a try.

library(TED)
library(tidyverse)
library(biomaRt)
library(Seurat)
library(gplots)
source("../code/CustomFunctions.R")
library(knitr)
library(scales)

This block references a file that I keep on my local computer, that is not reproducible from the snakemake. This rds file references a seurat object I created from publicaly available mouse scRNA-seq data (Tabula Muris) using this Rscript.

Heart.seur = readRDS("../big_data/TabMuris_heart_seurat.rds")

Some other data to read in… This includes the bulk RNA-seq count data from this study (39 human, 39 chimp), and some other necessary data to convert mouse gene names to human, etc.

#read in table of overdispersion, dispersion, mean expresion estimates
Dispersion <- read.delim('../output/OverdispersionEstimatesFromChimp.txt')

Dispersion$MeanDispersion <- (Dispersion$Chimp.Residual + Dispersion$Human.Residual)/2

CountTableChimpFile <- '../output/PowerAnalysisFullCountTable.Chimp.subread.txt.gz'
CountTableHumanFile <- '../output/PowerAnalysisFullCountTable.Human.subread.txt.gz'
OutputDE <- '../output/Final/TableS2.tab'
DropFileName <- '../data/DE_SamplesToDrop.txt'

DropFile <- read.delim(DropFileName, sep='\t', col.names = c("Sample", "Species"), stringsAsFactors = F)
HumanSamplesToDrop <- DropFile %>% filter(Species=="Human") %>% pull(Sample)
ChimpSamplesToDrop <- DropFile %>% filter(Species=="Chimp") %>% pull(Sample)
DE.results <- read.delim(OutputDE, sep='\t', stringsAsFactors = F)
GeneListForOverdispersionCalculation <- DE.results$Ensembl_geneID

CountTables <- GetCountTables(CountTableChimpFile,
                              CountTableHumanFile,
                              0, GeneListForOverdispersionCalculation, ChimpSampleDrop=ChimpSamplesToDrop, HumanSampleDrop = HumanSamplesToDrop)

human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")

MouseGenes = rownames(Heart.seur)
genes = getLDS(attributes = c("mgi_symbol"), filters = "mgi_symbol", values = MouseGenes ,mart = mouse, attributesL = c("ensembl_gene_id", "chromosome_name", "start_position", "mmusculus_homolog_orthology_type"), martL = human, uniqueRows=T)

# Get list of one to one orthologs in both tabula muris and our dataset for further analysis
one2one_HumanMouseOrthologs <- genes %>%
  filter(Mouse.homology.type=="ortholog_one2one") %>%
  filter(Gene.stable.ID %in% Dispersion$gene) %>%
  dplyr::select(MGI.symbol, Gene.stable.ID) %>%
  distinct(MGI.symbol, .keep_all = T)

Now prepare input data from run.Ted function:

CellTypes <- data.frame(Cell.ID=colnames(Heart.seur@assays$RNA), CellType=Heart.seur@meta.data$Cell.ontology.class)

scCountTable.Filtered <- Heart.seur@assays$RNA[one2one_HumanMouseOrthologs$MGI.symbol,] %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  inner_join(one2one_HumanMouseOrthologs, by=c("rowname"="MGI.symbol")) %>%
  dplyr::select(-rowname) %>%
  column_to_rownames("Gene.stable.ID") %>% t() %>%
  as.data.frame() %>%
  rownames_to_column("Cell.ID") %>%
  inner_join(CellTypes, by="Cell.ID") %>%
  filter(!CellType == "unassigned cell type") %>%
  dplyr::select(-CellType) %>%
  column_to_rownames("Cell.ID") %>% t()
  
scCountTable.Filtered[1:10,1:10] %>% kable()
A1.B000412.3_56_F.1.1 A1.B000633.3_56_F.1.1 A1.B002423.3_39_F.1.1 A1.B002427.3_39_F.1.1 A1.B002428.3_38_F.1.1 A1.MAA000398.3_9_M.1.1 A1.MAA000400.3_8_M.1.1 A1.MAA000898.3_11_M.1.1 A1.MAA000899.3_10_M.1.1 A1.MAA000903.3_11_M.1.1
ENSG00000265491 295 0 516 0 0 0 59 0 135 0
ENSG00000265972 676 555 0 1 82 414 0 1068 299 467
ENSG00000271601 0 0 0 0 0 0 0 0 0 0
ENSG00000272031 0 0 0 0 0 0 0 0 0 0
ENSG00000134250 0 48 181 1 36 12 269 583 12 828
ENSG00000131788 0 0 0 0 0 0 0 13 14 0
ENSG00000186141 0 0 0 0 0 0 0 0 0 305
ENSG00000186364 0 0 0 0 0 0 0 0 0 0
ENSG00000143127 0 0 0 0 0 0 0 0 0 0
ENSG00000198483 0 0 0 0 0 0 0 137 0 0
CellTypeVector <- CellTypes %>%
  filter(!CellType == "unassigned cell type") %>% pull(CellType) %>% as.character()
head(CellTypeVector)
[1] "endothelial cell"   "leukocyte"          "fibroblast"        
[4] "myofibroblast cell" "endocardial cell"   "fibroblast"        
Bulk.CountTable <- cbind(CountTables$Chimp$Counts, CountTables$Human$Counts) %>%
  rownames_to_column() %>%
  inner_join(one2one_HumanMouseOrthologs, by=c("rowname"="Gene.stable.ID")) %>%
  dplyr::select(-MGI.symbol) %>%
  column_to_rownames()
Bulk.CountTable[1:10,1:10] %>% kable()
C.4x0519 C.4x373 C.476 C.4x0025 C.438 C.462 C.4X0354 C.503 C.317 C.623
ENSG00000186891 46 73 20 28 4 11 98 13 11 3
ENSG00000186827 274 393 313 397 173 128 975 236 307 84
ENSG00000078808 5967 1931 7637 3892 2957 3745 1189 5857 4306 3704
ENSG00000176022 1097 402 1574 1333 955 1235 416 1087 745 640
ENSG00000160087 761 283 1153 776 471 610 275 920 547 604
ENSG00000131584 2587 1537 3646 2546 1755 1323 4585 2250 1484 1872
ENSG00000169972 282 78 396 257 209 282 111 243 141 252
ENSG00000169962 320 111 366 80 178 145 269 416 237 107
ENSG00000107404 7019 3600 11211 7397 5731 4670 3942 9779 5566 6768
ENSG00000162576 3701 1059 1384 2114 1426 2418 1242 1360 2221 1948

Now, use run.Ted function to estimate cell type porportions and also per individual expression within each cell type. In this code block, which I have run before knitting the Rmarkdown, I will also save the results to an RDS file that I will read in the next code block, so I don’t have to re-run this time consuming function when I want to knit the Rmarkdown.

Ted.results <- run.Ted(t(scCountTable.Filtered), t(Bulk.CountTable), pheno.labels = CellTypeVector, input.type = "scRNA", n.cores = 1)
saveRDS(Ted.results, file = "../big_data/TedResults.rds")

Now read in the ted results.

Ted.results <- readRDS("../big_data/TedResults.rds")
Ted.Expression.Per.CellType <- Ted.results$res$first.gibbs.res$Znkg #mean reads in each cell type
Ted.CellTypePorportions <- Ted.results$res$first.gibbs.res$theta.merged #cell fraction

Ok, now let’s compare the cell type porportions to CIBERSORT results…

#Ted cell type porportions table
Ted.CellTypePorportions %>% head() %>% kable()
endothelial cell leukocyte fibroblast myofibroblast cell endocardial cell cardiac muscle cell professional antigen presenting cell erythrocyte smooth muscle cell
C.4x0519 0.0655604 0.0015653 0.0279680 0.0863701 0.0335372 0.5969150 0.0281230 0.1059831 0.0539779
C.4x373 0.0729209 0.0000807 0.0353951 0.0581507 0.0102430 0.5956729 0.0252973 0.0969409 0.1052985
C.476 0.0443883 0.0000164 0.0007314 0.0427403 0.0154651 0.7444661 0.0230774 0.0774483 0.0516668
C.4x0025 0.0840676 0.0002683 0.0140482 0.1144108 0.0322334 0.5566394 0.0434807 0.1087926 0.0460590
C.438 0.0438903 0.0000171 0.0061195 0.0420949 0.0162850 0.7467382 0.0235200 0.0658731 0.0554619
C.462 0.0661200 0.0005401 0.0194387 0.0645938 0.0215091 0.6691649 0.0268254 0.0845318 0.0472763
Ted.CellTypePorportions.df <- Ted.CellTypePorportions %>% as.data.frame() %>%
  rownames_to_column("Input.Sample") %>%
  select_all(~gsub("\\s+", ".", .)) %>%
  arrange(cardiac.muscle.cell) %>%
  gather(key="cell.type", value="percent", -Input.Sample) %>%
  mutate(Species = case_when(str_detect(Input.Sample, "C.") ~ "Chimp",
                   str_detect(Input.Sample, "H.") ~ "Human"))

#Ted cell type porportions as plot
ggplot(Ted.CellTypePorportions.df, aes(x=reorder(Input.Sample, percent, FUN=max), y=percent, fill=cell.type)) +
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(-0.001,1.001), expand = c(0, 0)) +
  facet_grid(~Species, scales="free_x", space="free_x") +
  theme_bw() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

#CIBERSORT cell type porportions
CIBERSORT <- read.table("../data/CIBERSORT.Output_Job2.csv", sep=',', header=T) %>%
  arrange(cardiac.muscle.cell) %>%
  gather(key="cell.type", value="percent", -Input.Sample, -P.value, -Pearson.Correlation, -RMSE) %>%
  mutate(Species = case_when(str_detect(Input.Sample, "C.") ~ "Chimp",
                   str_detect(Input.Sample, "H.") ~ "Human"))

#As a plot...
ggplot(CIBERSORT, aes(x=reorder(Input.Sample, percent, FUN=max), y=percent, fill=cell.type)) +
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(-0.001,1.001), expand = c(0, 0)) +
  facet_grid(~Species, scales="free_x", space="free_x") +
  theme_bw() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

#Compare CIBERSORT and TED cell porportions
inner_join(CIBERSORT, Ted.CellTypePorportions.df, by=c("Input.Sample", "cell.type"), suffix=c(".CIBERSORT", ".TED")) %>%
  ggplot(aes(x=percent.TED, y=percent.CIBERSORT, color=cell.type)) +
  geom_point(aes(shape=Species.TED)) +
  theme_bw()

inner_join(CIBERSORT, Ted.CellTypePorportions.df, by=c("Input.Sample", "cell.type"), suffix=c(".CIBERSORT", ".TED")) %>%
  ggplot(aes(x=percent.TED, y=percent.CIBERSORT, color=cell.type)) +
  geom_point(aes(shape=Species.TED)) +
  facet_wrap(~cell.type, scales = "free") +
  theme_bw()

Ok, some reasonable agreement between the cell type porportions for some cell types, but not for others (particularly the less prevalent cell types).

Ok, now let’s explore TED’s output for expression per cell type per individual, and consider estimating dispersion in a cell type specific fashion in a way similar to the bulk dispersion estimates.

Ted.Expression.Per.CellType %>% dim()
[1]    79     9 10001
dimnames(Ted.Expression.Per.CellType)[[1]] <- rownames(Ted.results$res$first.gibbs.res$theta.merged)
dimnames(Ted.Expression.Per.CellType)[[2]] <- colnames(Ted.results$res$first.gibbs.res$theta.merged)
dimnames(Ted.Expression.Per.CellType)[[3]] <- colnames(Ted.results$res$first.gibbs.res$Zkg)

Ted.Expression.Per.CellType.df <- Ted.Expression.Per.CellType %>%
  as.data.frame() %>% t() %>% as.data.frame() %>%
  rownames_to_column() %>%
  separate(rowname, into=c("CellType", "gene"), sep="\\.")

Ted.Expression.Per.CellType.df %>% head() %>% kable()
CellType gene C.4x0519 C.4x373 C.476 C.4x0025 C.438 C.462 C.4X0354 C.503 C.317 C.623 C.554 C.4X0095 C.549 C.4X0267 C.4X0339 C.529 C.495 C.338 C.4X0550 C.558 C.4x523 C.389 C.4x0043 C.4X0357 C.88A020 C.554_2 C.Little_R C.676 C.456 C.537 C.MD_And C.724 C.4x0430 C.95A014 C.4X0333 C.522 C.295 C.4X0212 C.570 H.SRR1477033 H.SRR598148 H.SRR601645 H.SRR599086 H.SRR604230 H.SRR1478189 H.SRR613186 H.59263 H.62905 H.63060 H.SRR600852 H.62606 H.SRR604174 H.SRR608096 H.SRR1489693 H.62765 H.61317 H.59167 H.SRR602106 H.SRR601868 H.SRR612335 H.SRR598589 H.SRR600474 H.SRR603449 H.SRR606939 H.SRR1481012 H.SRR599249 H.59365 H.SRR607313 H.63145 H.SRR1477569 H.SRR601239 H.SRR607252 H.SRR612875 H.SRR601613 H.SRR614683 H.SRR604122 H.59511 H.62850 H.SRR599380
endothelial cell ENSG00000271601 33.62 14.13 42.24 26.24 13.91 20.09 9.18 20.97 21.35 18.85 19.36 24.77 20.95 26.77 5.61 11.77 27.47 34.49 21.10 12.77 17.93 21.59 16.31 10.93 13.42 20.70 15.22 20.33 13.27 13.56 15.65 37.65 23.02 19.14 31.34 10.73 45.42 22.09 24.04 11.76 19.25 19.48 19.62 47.66 9.97 3.02 31.37 12.16 13.86 10.79 6.76 26.20 12.44 25.66 11.75 7.45 9.67 13.84 12.64 22.52 30.10 13.48 9.83 31.99 13.10 35.29 21.13 30.34 17.91 13.35 18.78 11.34 16.69 16.05 14.90 46.83 10.28 20.34 34.97
leukocyte ENSG00000271601 0.21 0.00 0.00 0.03 0.00 0.07 0.00 0.00 0.00 0.17 0.00 0.04 0.04 0.02 0.00 0.11 0.24 1.42 0.00 0.02 0.60 0.01 0.01 0.00 0.11 0.00 0.00 0.09 0.04 0.02 0.01 0.48 1.04 0.01 0.00 0.00 0.01 0.13 0.07 0.07 0.54 0.17 0.12 1.00 0.03 0.00 1.90 0.14 0.16 0.08 0.09 0.19 0.09 0.01 0.20 0.01 0.05 0.04 0.03 1.03 0.66 0.02 0.01 0.25 0.02 0.09 0.17 0.37 0.05 0.00 0.19 0.02 0.11 0.07 0.07 0.78 0.00 0.27 0.24
fibroblast ENSG00000271601 50.30 23.94 2.19 16.15 7.13 20.59 0.45 1.35 30.27 15.82 0.76 6.19 78.02 24.31 1.84 24.80 29.23 21.01 12.99 16.75 55.92 17.19 21.82 2.27 3.12 9.69 27.92 6.36 7.51 43.36 5.70 6.14 7.56 6.27 31.85 13.02 23.51 8.98 12.03 4.70 21.01 6.99 10.55 13.38 1.20 6.57 65.89 35.78 16.15 2.48 17.04 35.84 53.90 16.39 4.36 20.49 21.04 4.73 4.74 48.90 13.27 9.56 2.13 4.22 7.88 13.63 41.32 14.15 13.29 6.61 0.58 12.23 23.51 19.30 2.60 14.30 15.60 11.97 2.09
myofibroblast cell ENSG00000271601 116.55 30.33 108.13 97.52 36.27 53.13 13.36 50.35 63.34 49.61 53.00 59.96 58.25 64.29 9.13 27.65 52.42 78.03 72.87 24.29 73.06 63.32 43.24 21.50 44.20 55.98 15.05 64.04 45.34 26.86 31.00 76.45 61.39 53.24 91.24 35.34 92.34 47.52 71.77 42.96 51.82 56.50 45.18 165.63 29.82 6.36 146.65 32.01 44.51 36.22 27.54 142.43 67.65 46.80 73.26 21.68 21.12 38.41 41.20 127.33 112.89 43.43 33.31 92.20 62.67 89.46 101.15 82.89 50.59 38.47 55.65 73.29 55.28 65.74 45.63 137.46 26.61 76.28 93.03
endocardial cell ENSG00000271601 31.16 3.92 26.70 19.68 9.65 11.95 5.28 12.36 14.82 16.35 13.50 17.46 25.39 12.91 2.49 3.31 15.32 26.75 19.81 6.29 19.80 17.07 17.65 6.72 15.93 10.08 4.63 18.06 16.22 8.29 18.26 26.58 18.12 16.24 18.83 10.58 23.71 18.37 16.25 10.09 12.83 18.87 9.55 39.83 6.53 1.55 34.51 6.55 10.47 6.08 4.97 21.64 6.53 13.75 12.44 5.57 7.05 7.91 9.45 25.48 22.62 9.17 5.91 24.50 12.93 12.54 27.18 23.70 14.48 8.45 15.78 11.76 11.34 14.87 8.90 33.58 7.22 17.13 22.41
cardiac muscle cell ENSG00000271601 87.27 32.39 202.95 51.15 67.67 58.45 6.95 115.05 55.18 58.21 53.82 68.82 58.48 89.59 21.19 10.95 13.14 71.15 88.51 9.18 55.66 111.10 53.40 36.71 70.14 59.25 1.78 43.86 77.88 3.11 51.23 91.64 61.93 52.74 66.59 63.03 157.06 72.20 39.53 53.53 23.89 57.67 29.88 92.89 36.59 1.37 47.74 42.10 40.69 35.90 28.13 66.33 22.43 13.40 59.27 10.76 4.75 26.57 53.95 40.19 83.37 40.37 31.79 96.59 59.24 87.82 58.41 83.78 43.75 41.68 57.98 43.50 37.62 48.77 39.06 79.05 20.32 79.57 84.64

Are these expression values? Are they on a log scale? I should read some documentation, but let’s also just look at a histogram. I expect gene expression across different genes in an individual to appear roughly bell shaped on a log-scale.

hist(Ted.Expression.Per.CellType.df$C.4x0519)

hist(log10(Ted.Expression.Per.CellType.df$C.4x0519))

Yeah, so I think these are un-log transformed expression values. Are they normalized for sequencing depth per individual per cell type? Let’s check the median gene expression level for each individual x cell type combination… If the expression values are already normalized, I expect the median gene expression level to be roughly similar (at least, more similar than the >4 fold range in sequencing depth across samples)

Ted.Expression.Per.CellType.df %>%
  dplyr:: select(CellType, gene, starts_with("C.")) %>%
  gather("Ind", "Expression", -CellType, -gene) %>%
  group_by(Ind, CellType) %>%
  summarise(Med=median(Expression, na.rm=T)) %>%
  ggplot(aes(y=Med, x=Ind, fill=CellType)) +
  geom_col() +
  facet_wrap(~CellType) +
  theme_bw() +
  theme(
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Ok, things may not be normalized across cell types, but if I want to estimate dispersion on a per-cell type basis that may not matter… Let’s see if the median counts are related to total sequencing depth…

#Total gene mapping read counts from bulk count table
TotalBulkCount.df <- data.frame(TotalBulkCount = cbind(CountTables$Chimp$Counts, CountTables$Human$Counts) %>% colSums()) %>%
  rownames_to_column("Ind")
Ted.Expression.Per.CellType.df %>%
  dplyr:: select(CellType, gene, starts_with("C.")) %>%
  gather("Ind", "Expression", -CellType, -gene) %>%
  group_by(Ind, CellType) %>%
  summarise(Med=median(Expression, na.rm=T)) %>%
  left_join(TotalBulkCount.df, by="Ind") %>%
  ggplot(aes(x=Med, y=TotalBulkCount, color=CellType)) +
  geom_point() +
  facet_wrap(~CellType,  scales = "free") +
  theme_bw()

Ok, since total counts (columnSums) from the bulk matrix strongly correlates with the median gene expression measurements output by TED for most cell types, I conclude that this TED output is estimates of the original count table, (though, obviously not integer counts). So perhaps the most reasonable way to estimate dispersion per cell type from this TED output is to first transform the TED expression matrices to log(CPM), and, similarly to the dispersion estimation procedure in the original manuscript, fit a loess curve to the trend between the population mean expression and log(variance).

Though I have an intuition that \(log(variance(logCPM))\) is analogous to the \(log(\phi)\) that I used to estimate dispersion, I can’t derive mathematically why \(log(variance(logCPM))\)) is the appropriate metric (as opposed to \(sd(logCPM)\) for example). But, I can show empirically show that \(log(variance(logCPM))\) is basically equivalent to the \(log(\phi)\) parameter I estimated from fitting a negative binomial to the bulk raw count data:

#Compare the overdispersion (phi) estimate to the log(variance(logCPM)).

CountTables$Chimp$Counts %>%
  rownames_to_column("gene") %>%
  gather("Ind", "Expression", -gene) %>%
  left_join(TotalBulkCount.df, by="Ind") %>%
  mutate(Log.CPM.Expression=log(Expression/TotalBulkCount*1E6)) %>%
  group_by(gene) %>%
  summarise(mu=mean(Log.CPM.Expression), log.var.logcpm=log(var(Log.CPM.Expression))) %>%
  left_join(Dispersion, by="gene") %>%
  # head() %>%
  ggplot(aes(x=log.var.logcpm, y=log(Chimp.Overdispersion))) +
    geom_point()

#in contrast to the sd(logCPM)...
CountTables$Chimp$Counts %>%
  rownames_to_column("gene") %>%
  gather("Ind", "Expression", -gene) %>%
  left_join(TotalBulkCount.df, by="Ind") %>%
  mutate(Log.CPM.Expression=log(Expression/TotalBulkCount*1E6)) %>%
  group_by(gene) %>%
  summarise(mu=mean(Log.CPM.Expression), sd.logcpm=sd(Log.CPM.Expression)) %>%
  left_join(Dispersion, by="gene") %>%
  # head() %>%
  ggplot(aes(x=sd.logcpm, y=log(Chimp.Overdispersion))) +
    geom_point()

Ok, now let’s look at the loess plots and think about estimating dispersion on a per cell type (and per species) basis.

#loess log(var) vs mean for chimp by cell type
Normalized.Expression.Per.CellType <- Ted.Expression.Per.CellType.df %>%
  gather("Ind", "Expression", -CellType, -gene) %>%
  left_join(TotalBulkCount.df, by="Ind") %>%
  mutate(Log.CPM.Expression=log(Expression/TotalBulkCount*1E6))

#loess fit, just for chimp population
Normalized.Expression.Per.CellType %>%
  filter(startsWith(Ind, "C")) %>%
  group_by(gene, CellType) %>%
  summarise(mu=mean(Log.CPM.Expression), log.var=log(var(Log.CPM.Expression))) %>%
  ggplot(aes(x=mu, y=log.var, color=CellType)) +
  geom_point(alpha=0.05) +
  geom_smooth(method="loess", method.args=list(degree=1)) +
  theme_bw()

#...for both chimp and human
Normalized.Expression.Per.CellType %>%
  mutate(Species = case_when(str_detect(Ind, "C.") ~ "Chimp",
                   str_detect(Ind, "H.") ~ "Human")) %>%
  group_by(gene, CellType, Species) %>%
  summarise(mu=mean(Log.CPM.Expression), log.var=log(var(Log.CPM.Expression))) %>%
  ggplot(aes(x=mu, y=log.var, color=CellType)) +
  geom_point(alpha=0.03) +
  geom_smooth(method="loess", method.args=list(degree=1), aes(linetype=Species), color="black") +
  facet_wrap(~CellType) +
  theme_bw()

It seems most trends are similar between species for different cell types, except leukocytes which I may drop from further analysis. The fact thiat this leukocyte cell type is an outlier may be biologically meaningful, especially given that ~4 of the chimp samples have been challenged with virus. But it might also be technical, I would rather just avoid that cell type.

So now I will calculate the residual from each loess fit as the cell type specific dispersion. Similar to the bulk case, I may use bootstrapping to estimate standard error and perform inference to compare the human vs chimp dispersion estimates.

Let’s start first just by getting dispersion estimates, and worry about bootstrapping and inference later.

DataToEstimateCellTypeDispersion <- Normalized.Expression.Per.CellType %>%
  mutate(Species = case_when(str_detect(Ind, "C.") ~ "Chimp",
                   str_detect(Ind, "H.") ~ "Human")) %>%
  group_by(gene, CellType, Species) %>%
  summarise(mu=mean(Log.CPM.Expression), log.var=log(var(Log.CPM.Expression)))

#Estimate dispersion (residual from the loess fit)
CellTypeDispersion <- DataToEstimateCellTypeDispersion %>%
  group_by(Species, CellType) %>% 
    do(data.frame(., resid = residuals(loess(log.var ~ mu, data=., degree=1, na.action="na.exclude"))))

#Plot kernel density estimate of dispersion... One quick way to verify that I correctly obtained the residual of per-species and per-cell-type loess fits, is to look at the leukoctye in chimp and humans. If I did this correctly, I expect the dispersion estimates for every species x cell type combination to be roughly centered around 0, even the leukocytes.
CellTypeDispersion %>%
  ggplot(aes(x=resid, color=Species)) +
  geom_density() +
  facet_wrap(~CellType) +
  theme_bw()

Ok, it seems I have dispersion estimates for each cell type. How to present this in an interpretable way. How about I start by plotting a correlation matrix with hiearchal clustering dendrogram, comparing the dispersion estimates from bulk to those for each cell type, for both species.

Dispersion.Cor.Matrix <- Dispersion %>%
  dplyr::select(gene, Chimp_Bulk=Chimp.Dispersion, Human_Bulk=Human.Dispersion) %>%
  gather(key="Species.CellType", value="dispersion", -gene) %>%
  bind_rows(
    unite(CellTypeDispersion, Species.CellType, Species, CellType) %>%
      dplyr::select(gene, Species.CellType, dispersion=resid)) %>%
  dplyr::select(gene, Species.CellType, dispersion) %>%
  filter(gene %in% one2one_HumanMouseOrthologs$Gene.stable.ID) %>%
  pivot_wider(names_from=Species.CellType, values_from = dispersion) %>%
  column_to_rownames("gene") %>%
  as.matrix() %>%
  cor(use="pairwise.complete.obs")

SpeciesFactorCol <- colnames(Dispersion.Cor.Matrix) %>% substr(1,1) %>% factor() %>% unclass() %>% as.character() %>% recode("1"="#F8766D", "2"="#00BFC4")

TissueFactorCol <- data.frame(ColNames=colnames(Dispersion.Cor.Matrix)) %>%
  mutate(Tissue=str_remove(ColNames, "^.+?_"))
TissueColors.df <- data.frame(colors=c("black", hue_pal()(9)), Tissue=unique(sort(TissueFactorCol$Tissue)))
TissueFactorCol.vector <- TissueFactorCol %>%
  left_join(TissueColors.df, by="Tissue") %>% pull(colors) %>% as.character()

        
  
heatmap.2(Dispersion.Cor.Matrix, trace="none", ColSideColors=SpeciesFactorCol, RowSideColors = TissueFactorCol.vector)

Let’s contrast this with the correlation matrix for cell type expression estimates:

Expression.Matrix <- Dispersion %>%
  dplyr::select(gene, Chimp_Bulk=Chimp.Mean.Expression, Human_Bulk=Human.Mean.Expression) %>%
  gather(key="Species.CellType", value="expression", -gene) %>%
  bind_rows(
    unite(CellTypeDispersion, Species.CellType, Species, CellType) %>%
      dplyr::select(gene, Species.CellType, expression=mu)) %>%
  dplyr::select(gene, Species.CellType, expression) %>%
  filter(gene %in% one2one_HumanMouseOrthologs$Gene.stable.ID) %>%
  pivot_wider(names_from=Species.CellType, values_from = expression) %>%
  column_to_rownames("gene") %>%
  as.matrix()

Expression.Matrix[mapply(is.infinite, Expression.Matrix)] <- NA
Expression.Cor.Matrix <- cor(Expression.Matrix, use="pairwise.complete.obs")

heatmap.2(Expression.Cor.Matrix, trace="none", ColSideColors=SpeciesFactorCol, RowSideColors = TissueFactorCol.vector)


sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.1.0    knitr_1.26      cowplot_1.0.0   gridExtra_2.3  
 [5] edgeR_3.26.8    limma_3.40.6    MASS_7.3-51.4   gplots_3.0.1.1 
 [9] Seurat_3.1.1    biomaRt_2.40.5  forcats_0.4.0   stringr_1.4.0  
[13] dplyr_0.8.3     purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
[17] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0 TED_1.0        

loaded via a namespace (and not attached):
  [1] reticulate_1.13             R.utils_2.9.0              
  [3] tidyselect_0.2.5            RSQLite_2.1.4              
  [5] AnnotationDbi_1.46.1        htmlwidgets_1.5.1          
  [7] grid_3.6.1                  BiocParallel_1.18.1        
  [9] Rtsne_0.15                  munsell_0.5.0              
 [11] mutoss_0.1-12               codetools_0.2-16           
 [13] ica_1.0-2                   future_1.15.1              
 [15] withr_2.1.2                 colorspace_1.4-1           
 [17] Biobase_2.44.0              highr_0.8                  
 [19] rstudioapi_0.10             stats4_3.6.1               
 [21] ROCR_1.0-7                  gbRd_0.4-11                
 [23] listenv_0.8.0               labeling_0.3               
 [25] Rdpack_0.11-0               git2r_0.26.1               
 [27] GenomeInfoDbData_1.2.1      mnormt_1.5-5               
 [29] MCMCpack_1.4-9              farver_2.0.1               
 [31] bit64_0.9-7                 rprojroot_1.3-2            
 [33] TH.data_1.0-10              coda_0.19-3                
 [35] vctrs_0.2.0                 generics_0.0.2             
 [37] xfun_0.11                   R6_2.4.1                   
 [39] GenomeInfoDb_1.20.0         rsvd_1.0.2                 
 [41] locfit_1.5-9.1              bitops_1.0-6               
 [43] DelayedArray_0.10.0         assertthat_0.2.1           
 [45] promises_1.1.0              SDMTools_1.1-221.2         
 [47] multcomp_1.4-11             nnet_7.3-12                
 [49] gtable_0.3.0                npsurv_0.4-0               
 [51] globals_0.12.5              conquer_1.0.2              
 [53] mcmc_0.9-7                  sandwich_2.5-1             
 [55] workflowr_1.5.0             rlang_0.4.1                
 [57] MatrixModels_0.4-1          zeallot_0.1.0              
 [59] genefilter_1.66.0           splines_3.6.1              
 [61] lazyeval_0.2.2              broom_0.5.2                
 [63] checkmate_2.0.0             reshape2_1.4.3             
 [65] yaml_2.2.0                  modelr_0.1.5               
 [67] backports_1.1.5             httpuv_1.5.2               
 [69] Hmisc_4.4-1                 tools_3.6.1                
 [71] ellipsis_0.3.0              RColorBrewer_1.1-2         
 [73] BiocGenerics_0.30.0         ggridges_0.5.1             
 [75] TFisher_0.2.0               Rcpp_1.0.5                 
 [77] plyr_1.8.5                  base64enc_0.1-3            
 [79] progress_1.2.2              zlibbioc_1.30.0            
 [81] RCurl_1.95-4.12             prettyunits_1.0.2          
 [83] rpart_4.1-15                pbapply_1.4-2              
 [85] S4Vectors_0.22.1            zoo_1.8-6                  
 [87] SummarizedExperiment_1.14.1 haven_2.2.0                
 [89] ggrepel_0.8.1               cluster_2.1.0              
 [91] fs_1.3.1                    magrittr_1.5               
 [93] data.table_1.12.8           SparseM_1.78               
 [95] lmtest_0.9-37               reprex_0.3.0               
 [97] RANN_2.6.1                  mvtnorm_1.0-11             
 [99] fitdistrplus_1.0-14         matrixStats_0.56.0         
[101] hms_0.5.2                   lsei_1.2-0                 
[103] evaluate_0.14               xtable_1.8-4               
[105] XML_3.98-1.20               jpeg_0.1-8.1               
[107] readxl_1.3.1                IRanges_2.18.3             
[109] compiler_3.6.1              KernSmooth_2.23-16         
[111] crayon_1.3.4                R.oo_1.23.0                
[113] htmltools_0.4.0             later_1.0.0                
[115] Formula_1.2-3               geneplotter_1.62.0         
[117] RcppParallel_4.4.4          lubridate_1.7.4            
[119] DBI_1.0.0                   dbplyr_1.4.2               
[121] Matrix_1.2-18               cli_2.0.0                  
[123] R.methodsS3_1.7.1           gdata_2.18.0               
[125] parallel_3.6.1              metap_1.2                  
[127] igraph_1.2.4.2              GenomicRanges_1.36.1       
[129] pkgconfig_2.0.3             sn_1.5-4                   
[131] numDeriv_2016.8-1.1         foreign_0.8-72             
[133] plotly_4.9.1                xml2_1.2.2                 
[135] annotate_1.62.0             multtest_2.40.0            
[137] XVector_0.24.0              bibtex_0.4.2               
[139] rvest_0.3.5                 digest_0.6.23              
[141] tsne_0.1-3                  sctransform_0.2.0          
[143] RcppAnnoy_0.0.14            rmarkdown_1.18             
[145] cellranger_1.1.0            leiden_0.3.1               
[147] htmlTable_2.0.1             uwot_0.1.5                 
[149] curl_4.3                    gtools_3.8.1               
[151] quantreg_5.61               lifecycle_0.1.0            
[153] nlme_3.1-143                jsonlite_1.6               
[155] viridisLite_0.3.0           fansi_0.4.0                
[157] pillar_1.4.2                lattice_0.20-38            
[159] plotrix_3.7-7               httr_1.4.1                 
[161] survival_3.1-8              glue_1.3.1                 
[163] png_0.1-7                   bit_1.1-14                 
[165] stringi_1.4.3               blob_1.2.0                 
[167] DESeq2_1.24.0               latticeExtra_0.6-29        
[169] caTools_1.17.1.3            memoise_1.1.0              
[171] irlba_2.3.3                 future.apply_1.3.0         
[173] ape_5.3