ResponseToReviewer

Last updated: 2020-09-09

Checks: 6 1

Knit directory: Comparative_eQTL/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190319)

The command set.seed(20190319) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 199654f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    WorkingManuscript.zip
    Ignored:    WorkingManuscript/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    analysis_temp/.DS_Store
    Ignored:    big_data/
    Ignored:    code/.DS_Store
    Ignored:    code/snakemake_workflow/.DS_Store
    Ignored:    code/snakemake_workflow/.Rhistory
    Ignored:    data/.DS_Store
    Ignored:    data/PastAnalysesDataToKeep/.DS_Store
    Ignored:    figures/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  analysis/20200907_Response_OriginalComments.Rmd
    Untracked:  analysis/20200907_Response_Point_11.Rmd

Unstaged changes:
    Modified:   analysis_temp/Ted_vignette.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Original reviewer point:

Have the authors tried estimating dispersion on top of what is expected based on differences in cell type? There are several strategies that might work for this: There are new strategies for estimating a posterior of cell type specific expression from a bulk sample, conditional on scRNA-seq data as prior information (Chu and Danko, bioRxiv, 2020). These cell type specific expression estimates could then be analyzed for dispersion. Alternatively, it may also work to regress the estimated proportion of each cell type out of the dispersion estimates. While there are certainly a lot of pitfalls with using these strategies, especially in the setting shown here (all of this would work better if there were species matched reference data), they might provide an avenue for depleting the contribution of cell type differences from dispersion estimates.

Here I will use the Chu and Danko methodology (“TED”, or “BayesPrism”, see https://github.com/Danko-Lab/TED) to estimate cell type specific expression estimates for each individual, and then estimate a cell type specific dispersion estimate. As the reviewer notes, there a lot of pitfalls with this strategy. Really, in the absence of scRNA-seq data (and a human scRNA-seq reference), there is only so much we can gain I think. But still worth a try.

library(TED)
library(tidyverse)
library(biomaRt)
library(Seurat)
library(gplots)
source("../code/CustomFunctions.R")
library(knitr)
library(scales)

This block references a file that I keep on my local computer, that is not reproducible from the snakemake. This rds file references a seurat object I created from publicaly available mouse scRNA-seq data (Tabula Muris) using this Rscript.

Heart.seur = readRDS("../big_data/TabMuris_heart_seurat.rds")

Some other data to read in… This includes the bulk RNA-seq count data from this study (39 human, 39 chimp), and some other necessary data to convert mouse gene names to human, etc.

#read in table of overdispersion, dispersion, mean expresion estimates
Dispersion <- read.delim('../output/OverdispersionEstimatesFromChimp.txt')

Dispersion$MeanDispersion <- (Dispersion$Chimp.Residual + Dispersion$Human.Residual)/2

CountTableChimpFile <- '../output/PowerAnalysisFullCountTable.Chimp.subread.txt.gz'
CountTableHumanFile <- '../output/PowerAnalysisFullCountTable.Human.subread.txt.gz'
OutputDE <- '../output/Final/TableS2.tab'
DropFileName <- '../data/DE_SamplesToDrop.txt'

DropFile <- read.delim(DropFileName, sep='\t', col.names = c("Sample", "Species"), stringsAsFactors = F)
HumanSamplesToDrop <- DropFile %>% filter(Species=="Human") %>% pull(Sample)
ChimpSamplesToDrop <- DropFile %>% filter(Species=="Chimp") %>% pull(Sample)
DE.results <- read.delim(OutputDE, sep='\t', stringsAsFactors = F)
GeneListForOverdispersionCalculation <- DE.results$Ensembl_geneID

CountTables <- GetCountTables(CountTableChimpFile,
                              CountTableHumanFile,
                              0, GeneListForOverdispersionCalculation, ChimpSampleDrop=ChimpSamplesToDrop, HumanSampleDrop = HumanSamplesToDrop)

human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")

MouseGenes = rownames(Heart.seur)
genes = getLDS(attributes = c("mgi_symbol"), filters = "mgi_symbol", values = MouseGenes ,mart = mouse, attributesL = c("ensembl_gene_id", "chromosome_name", "start_position", "mmusculus_homolog_orthology_type"), martL = human, uniqueRows=T)

# Get list of one to one orthologs in both tabula muris and our dataset for further analysis
one2one_HumanMouseOrthologs <- genes %>%
  filter(Mouse.homology.type=="ortholog_one2one") %>%
  filter(Gene.stable.ID %in% Dispersion$gene) %>%
  dplyr::select(MGI.symbol, Gene.stable.ID) %>%
  distinct(MGI.symbol, .keep_all = T)

Now prepare input data from run.Ted function:

CellTypes <- data.frame(Cell.ID=colnames(Heart.seur@assays$RNA), CellType=Heart.seur@meta.data$Cell.ontology.class)

scCountTable.Filtered <- Heart.seur@assays$RNA[one2one_HumanMouseOrthologs$MGI.symbol,] %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  inner_join(one2one_HumanMouseOrthologs, by=c("rowname"="MGI.symbol")) %>%
  dplyr::select(-rowname) %>%
  column_to_rownames("Gene.stable.ID") %>% t() %>%
  as.data.frame() %>%
  rownames_to_column("Cell.ID") %>%
  inner_join(CellTypes, by="Cell.ID") %>%
  filter(!CellType == "unassigned cell type") %>%
  dplyr::select(-CellType) %>%
  column_to_rownames("Cell.ID") %>% t()
  
scCountTable.Filtered[1:10,1:10] %>% kable()

	A1.B000412.3_56_F.1.1	A1.B000633.3_56_F.1.1	A1.B002423.3_39_F.1.1	A1.B002427.3_39_F.1.1	A1.B002428.3_38_F.1.1	A1.MAA000398.3_9_M.1.1	A1.MAA000400.3_8_M.1.1	A1.MAA000898.3_11_M.1.1	A1.MAA000899.3_10_M.1.1	A1.MAA000903.3_11_M.1.1
ENSG00000265491	295	0	516	0	0	0	59	0	135	0
ENSG00000265972	676	555	0	1	82	414	0	1068	299	467
ENSG00000271601	0	0	0	0	0	0	0	0	0	0
ENSG00000272031	0	0	0	0	0	0	0	0	0	0
ENSG00000134250	0	48	181	1	36	12	269	583	12	828
ENSG00000131788	0	0	0	0	0	0	0	13	14	0
ENSG00000186141	0	0	0	0	0	0	0	0	0	305
ENSG00000186364	0	0	0	0	0	0	0	0	0	0
ENSG00000143127	0	0	0	0	0	0	0	0	0	0
ENSG00000198483	0	0	0	0	0	0	0	137	0	0

CellTypeVector <- CellTypes %>%
  filter(!CellType == "unassigned cell type") %>% pull(CellType) %>% as.character()
head(CellTypeVector)

[1] "endothelial cell"   "leukocyte"          "fibroblast"        
[4] "myofibroblast cell" "endocardial cell"   "fibroblast"

Bulk.CountTable <- cbind(CountTables$Chimp$Counts, CountTables$Human$Counts) %>%
  rownames_to_column() %>%
  inner_join(one2one_HumanMouseOrthologs, by=c("rowname"="Gene.stable.ID")) %>%
  dplyr::select(-MGI.symbol) %>%
  column_to_rownames()
Bulk.CountTable[1:10,1:10] %>% kable()

	C.4x0519	C.4x373	C.476	C.4x0025	C.438	C.462	C.4X0354	C.503	C.317	C.623
ENSG00000186891	46	73	20	28	4	11	98	13	11	3
ENSG00000186827	274	393	313	397	173	128	975	236	307	84
ENSG00000078808	5967	1931	7637	3892	2957	3745	1189	5857	4306	3704
ENSG00000176022	1097	402	1574	1333	955	1235	416	1087	745	640
ENSG00000160087	761	283	1153	776	471	610	275	920	547	604
ENSG00000131584	2587	1537	3646	2546	1755	1323	4585	2250	1484	1872
ENSG00000169972	282	78	396	257	209	282	111	243	141	252
ENSG00000169962	320	111	366	80	178	145	269	416	237	107
ENSG00000107404	7019	3600	11211	7397	5731	4670	3942	9779	5566	6768
ENSG00000162576	3701	1059	1384	2114	1426	2418	1242	1360	2221	1948

Now, use run.Ted function to estimate cell type porportions and also per individual expression within each cell type. In this code block, which I have run before knitting the Rmarkdown, I will also save the results to an RDS file that I will read in the next code block, so I don’t have to re-run this time consuming function when I want to knit the Rmarkdown.

Ted.results <- run.Ted(t(scCountTable.Filtered), t(Bulk.CountTable), pheno.labels = CellTypeVector, input.type = "scRNA", n.cores = 1)
saveRDS(Ted.results, file = "../big_data/TedResults.rds")

Now read in the ted results.

Ted.results <- readRDS("../big_data/TedResults.rds")
Ted.Expression.Per.CellType <- Ted.results$res$first.gibbs.res$Znkg #mean reads in each cell type
Ted.CellTypePorportions <- Ted.results$res$first.gibbs.res$theta.merged #cell fraction

Ok, now let’s compare the cell type porportions to CIBERSORT results…

#Ted cell type porportions table
Ted.CellTypePorportions %>% head() %>% kable()

	endothelial cell	leukocyte	fibroblast	myofibroblast cell	endocardial cell	cardiac muscle cell	professional antigen presenting cell	erythrocyte	smooth muscle cell
C.4x0519	0.0655604	0.0015653	0.0279680	0.0863701	0.0335372	0.5969150	0.0281230	0.1059831	0.0539779
C.4x373	0.0729209	0.0000807	0.0353951	0.0581507	0.0102430	0.5956729	0.0252973	0.0969409	0.1052985
C.476	0.0443883	0.0000164	0.0007314	0.0427403	0.0154651	0.7444661	0.0230774	0.0774483	0.0516668
C.4x0025	0.0840676	0.0002683	0.0140482	0.1144108	0.0322334	0.5566394	0.0434807	0.1087926	0.0460590
C.438	0.0438903	0.0000171	0.0061195	0.0420949	0.0162850	0.7467382	0.0235200	0.0658731	0.0554619
C.462	0.0661200	0.0005401	0.0194387	0.0645938	0.0215091	0.6691649	0.0268254	0.0845318	0.0472763

Ted.CellTypePorportions.df <- Ted.CellTypePorportions %>% as.data.frame() %>%
  rownames_to_column("Input.Sample") %>%
  select_all(~gsub("\\s+", ".", .)) %>%
  arrange(cardiac.muscle.cell) %>%
  gather(key="cell.type", value="percent", -Input.Sample) %>%
  mutate(Species = case_when(str_detect(Input.Sample, "C.") ~ "Chimp",
                   str_detect(Input.Sample, "H.") ~ "Human"))

#Ted cell type porportions as plot
ggplot(Ted.CellTypePorportions.df, aes(x=reorder(Input.Sample, percent, FUN=max), y=percent, fill=cell.type)) +
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(-0.001,1.001), expand = c(0, 0)) +
  facet_grid(~Species, scales="free_x", space="free_x") +
  theme_bw() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

#CIBERSORT cell type porportions
CIBERSORT <- read.table("../data/CIBERSORT.Output_Job2.csv", sep=',', header=T) %>%
  arrange(cardiac.muscle.cell) %>%
  gather(key="cell.type", value="percent", -Input.Sample, -P.value, -Pearson.Correlation, -RMSE) %>%
  mutate(Species = case_when(str_detect(Input.Sample, "C.") ~ "Chimp",
                   str_detect(Input.Sample, "H.") ~ "Human"))

#As a plot...
ggplot(CIBERSORT, aes(x=reorder(Input.Sample, percent, FUN=max), y=percent, fill=cell.type)) +
  geom_bar(stat="identity") +
  scale_y_continuous(limits = c(-0.001,1.001), expand = c(0, 0)) +
  facet_grid(~Species, scales="free_x", space="free_x") +
  theme_bw() +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

#Compare CIBERSORT and TED cell porportions
inner_join(CIBERSORT, Ted.CellTypePorportions.df, by=c("Input.Sample", "cell.type"), suffix=c(".CIBERSORT", ".TED")) %>%
  ggplot(aes(x=percent.TED, y=percent.CIBERSORT, color=cell.type)) +
  geom_point(aes(shape=Species.TED)) +
  theme_bw()

inner_join(CIBERSORT, Ted.CellTypePorportions.df, by=c("Input.Sample", "cell.type"), suffix=c(".CIBERSORT", ".TED")) %>%
  ggplot(aes(x=percent.TED, y=percent.CIBERSORT, color=cell.type)) +
  geom_point(aes(shape=Species.TED)) +
  facet_wrap(~cell.type, scales = "free") +
  theme_bw()

Ok, some reasonable agreement between the cell type porportions for some cell types, but not for others (particularly the less prevalent cell types).

Ok, now let’s explore TED’s output for expression per cell type per individual, and consider estimating dispersion in a cell type specific fashion in a way similar to the bulk dispersion estimates.

Ted.Expression.Per.CellType %>% dim()

[1]    79     9 10001

dimnames(Ted.Expression.Per.CellType)[[1]] <- rownames(Ted.results$res$first.gibbs.res$theta.merged)
dimnames(Ted.Expression.Per.CellType)[[2]] <- colnames(Ted.results$res$first.gibbs.res$theta.merged)
dimnames(Ted.Expression.Per.CellType)[[3]] <- colnames(Ted.results$res$first.gibbs.res$Zkg)

Ted.Expression.Per.CellType.df <- Ted.Expression.Per.CellType %>%
  as.data.frame() %>% t() %>% as.data.frame() %>%
  rownames_to_column() %>%
  separate(rowname, into=c("CellType", "gene"), sep="\\.")

Ted.Expression.Per.CellType.df %>% head() %>% kable()

CellType	gene	C.4x0519	C.4x373	C.476	C.4x0025	C.438	C.462	C.4X0354	C.503	C.317	C.623	C.554	C.4X0095	C.549	C.4X0267	C.4X0339	C.529	C.495	C.338	C.4X0550	C.558	C.4x523	C.389	C.4x0043	C.4X0357	C.88A020	C.554_2	C.Little_R	C.676	C.456	C.537	C.MD_And	C.724	C.4x0430	C.95A014	C.4X0333	C.522	C.295	C.4X0212	C.570	H.SRR1477033	H.SRR598148	H.SRR601645	H.SRR599086	H.SRR604230	H.SRR1478189	H.SRR613186	H.59263	H.62905	H.63060	H.SRR600852	H.62606	H.SRR604174	H.SRR608096	H.SRR1489693	H.62765	H.61317	H.59167	H.SRR602106	H.SRR601868	H.SRR612335	H.SRR598589	H.SRR600474	H.SRR603449	H.SRR606939	H.SRR1481012	H.SRR599249	H.59365	H.SRR607313	H.63145	H.SRR1477569	H.SRR601239	H.SRR607252	H.SRR612875	H.SRR601613	H.SRR614683	H.SRR604122	H.59511	H.62850	H.SRR599380
endothelial cell	ENSG00000271601	33.62	14.13	42.24	26.24	13.91	20.09	9.18	20.97	21.35	18.85	19.36	24.77	20.95	26.77	5.61	11.77	27.47	34.49	21.10	12.77	17.93	21.59	16.31	10.93	13.42	20.70	15.22	20.33	13.27	13.56	15.65	37.65	23.02	19.14	31.34	10.73	45.42	22.09	24.04	11.76	19.25	19.48	19.62	47.66	9.97	3.02	31.37	12.16	13.86	10.79	6.76	26.20	12.44	25.66	11.75	7.45	9.67	13.84	12.64	22.52	30.10	13.48	9.83	31.99	13.10	35.29	21.13	30.34	17.91	13.35	18.78	11.34	16.69	16.05	14.90	46.83	10.28	20.34	34.97
leukocyte	ENSG00000271601	0.21	0.00	0.00	0.03	0.00	0.07	0.00	0.00	0.00	0.17	0.00	0.04	0.04	0.02	0.00	0.11	0.24	1.42	0.00	0.02	0.60	0.01	0.01	0.00	0.11	0.00	0.00	0.09	0.04	0.02	0.01	0.48	1.04	0.01	0.00	0.00	0.01	0.13	0.07	0.07	0.54	0.17	0.12	1.00	0.03	0.00	1.90	0.14	0.16	0.08	0.09	0.19	0.09	0.01	0.20	0.01	0.05	0.04	0.03	1.03	0.66	0.02	0.01	0.25	0.02	0.09	0.17	0.37	0.05	0.00	0.19	0.02	0.11	0.07	0.07	0.78	0.00	0.27	0.24
fibroblast	ENSG00000271601	50.30	23.94	2.19	16.15	7.13	20.59	0.45	1.35	30.27	15.82	0.76	6.19	78.02	24.31	1.84	24.80	29.23	21.01	12.99	16.75	55.92	17.19	21.82	2.27	3.12	9.69	27.92	6.36	7.51	43.36	5.70	6.14	7.56	6.27	31.85	13.02	23.51	8.98	12.03	4.70	21.01	6.99	10.55	13.38	1.20	6.57	65.89	35.78	16.15	2.48	17.04	35.84	53.90	16.39	4.36	20.49	21.04	4.73	4.74	48.90	13.27	9.56	2.13	4.22	7.88	13.63	41.32	14.15	13.29	6.61	0.58	12.23	23.51	19.30	2.60	14.30	15.60	11.97	2.09
myofibroblast cell	ENSG00000271601	116.55	30.33	108.13	97.52	36.27	53.13	13.36	50.35	63.34	49.61	53.00	59.96	58.25	64.29	9.13	27.65	52.42	78.03	72.87	24.29	73.06	63.32	43.24	21.50	44.20	55.98	15.05	64.04	45.34	26.86	31.00	76.45	61.39	53.24	91.24	35.34	92.34	47.52	71.77	42.96	51.82	56.50	45.18	165.63	29.82	6.36	146.65	32.01	44.51	36.22	27.54	142.43	67.65	46.80	73.26	21.68	21.12	38.41	41.20	127.33	112.89	43.43	33.31	92.20	62.67	89.46	101.15	82.89	50.59	38.47	55.65	73.29	55.28	65.74	45.63	137.46	26.61	76.28	93.03
endocardial cell	ENSG00000271601	31.16	3.92	26.70	19.68	9.65	11.95	5.28	12.36	14.82	16.35	13.50	17.46	25.39	12.91	2.49	3.31	15.32	26.75	19.81	6.29	19.80	17.07	17.65	6.72	15.93	10.08	4.63	18.06	16.22	8.29	18.26	26.58	18.12	16.24	18.83	10.58	23.71	18.37	16.25	10.09	12.83	18.87	9.55	39.83	6.53	1.55	34.51	6.55	10.47	6.08	4.97	21.64	6.53	13.75	12.44	5.57	7.05	7.91	9.45	25.48	22.62	9.17	5.91	24.50	12.93	12.54	27.18	23.70	14.48	8.45	15.78	11.76	11.34	14.87	8.90	33.58	7.22	17.13	22.41
cardiac muscle cell	ENSG00000271601	87.27	32.39	202.95	51.15	67.67	58.45	6.95	115.05	55.18	58.21	53.82	68.82	58.48	89.59	21.19	10.95	13.14	71.15	88.51	9.18	55.66	111.10	53.40	36.71	70.14	59.25	1.78	43.86	77.88	3.11	51.23	91.64	61.93	52.74	66.59	63.03	157.06	72.20	39.53	53.53	23.89	57.67	29.88	92.89	36.59	1.37	47.74	42.10	40.69	35.90	28.13	66.33	22.43	13.40	59.27	10.76	4.75	26.57	53.95	40.19	83.37	40.37	31.79	96.59	59.24	87.82	58.41	83.78	43.75	41.68	57.98	43.50	37.62	48.77	39.06	79.05	20.32	79.57	84.64

Are these expression values? Are they on a log scale? I should read some documentation, but let’s also just look at a histogram. I expect gene expression across different genes in an individual to appear roughly bell shaped on a log-scale.

hist(Ted.Expression.Per.CellType.df$C.4x0519)

hist(log10(Ted.Expression.Per.CellType.df$C.4x0519))

Yeah, so I think these are un-log transformed expression values. Are they normalized for sequencing depth per individual per cell type? Let’s check the median gene expression level for each individual x cell type combination… If the expression values are already normalized, I expect the median gene expression level to be roughly similar (at least, more similar than the >4 fold range in sequencing depth across samples)

Ted.Expression.Per.CellType.df %>%
  dplyr:: select(CellType, gene, starts_with("C.")) %>%
  gather("Ind", "Expression", -CellType, -gene) %>%
  group_by(Ind, CellType) %>%
  summarise(Med=median(Expression, na.rm=T)) %>%
  ggplot(aes(y=Med, x=Ind, fill=CellType)) +
  geom_col() +
  facet_wrap(~CellType) +
  theme_bw() +
  theme(
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Ok, things may not be normalized across cell types, but if I want to estimate dispersion on a per-cell type basis that may not matter… Let’s see if the median counts are related to total sequencing depth…

#Total gene mapping read counts from bulk count table
TotalBulkCount.df <- data.frame(TotalBulkCount = cbind(CountTables$Chimp$Counts, CountTables$Human$Counts) %>% colSums()) %>%
  rownames_to_column("Ind")
Ted.Expression.Per.CellType.df %>%
  dplyr:: select(CellType, gene, starts_with("C.")) %>%
  gather("Ind", "Expression", -CellType, -gene) %>%
  group_by(Ind, CellType) %>%
  summarise(Med=median(Expression, na.rm=T)) %>%
  left_join(TotalBulkCount.df, by="Ind") %>%
  ggplot(aes(x=Med, y=TotalBulkCount, color=CellType)) +
  geom_point() +
  facet_wrap(~CellType,  scales = "free") +
  theme_bw()

Ok, since total counts (columnSums) from the bulk matrix strongly correlates with the median gene expression measurements output by TED for most cell types, I conclude that this TED output is estimates of the original count table, (though, obviously not integer counts). So perhaps the most reasonable way to estimate dispersion per cell type from this TED output is to first transform the TED expression matrices to log(CPM), and, similarly to the dispersion estimation procedure in the original manuscript, fit a loess curve to the trend between the population mean expression and log(variance).

Though I have an intuition that $log(variance(logCPM))$ is analogous to the $log(\phi)$ that I used to estimate dispersion, I can’t derive mathematically why $log(variance(logCPM))$ ) is the appropriate metric (as opposed to $sd(logCPM)$ for example). But, I can show empirically show that $log(variance(logCPM))$ is basically equivalent to the $log(\phi)$ parameter I estimated from fitting a negative binomial to the bulk raw count data:

#Compare the overdispersion (phi) estimate to the log(variance(logCPM)).

CountTables$Chimp$Counts %>%
  rownames_to_column("gene") %>%
  gather("Ind", "Expression", -gene) %>%
  left_join(TotalBulkCount.df, by="Ind") %>%
  mutate(Log.CPM.Expression=log(Expression/TotalBulkCount*1E6)) %>%
  group_by(gene) %>%
  summarise(mu=mean(Log.CPM.Expression), log.var.logcpm=log(var(Log.CPM.Expression))) %>%
  left_join(Dispersion, by="gene") %>%
  # head() %>%
  ggplot(aes(x=log.var.logcpm, y=log(Chimp.Overdispersion))) +
    geom_point()

#in contrast to the sd(logCPM)...
CountTables$Chimp$Counts %>%
  rownames_to_column("gene") %>%
  gather("Ind", "Expression", -gene) %>%
  left_join(TotalBulkCount.df, by="Ind") %>%
  mutate(Log.CPM.Expression=log(Expression/TotalBulkCount*1E6)) %>%
  group_by(gene) %>%
  summarise(mu=mean(Log.CPM.Expression), sd.logcpm=sd(Log.CPM.Expression)) %>%
  left_join(Dispersion, by="gene") %>%
  # head() %>%
  ggplot(aes(x=sd.logcpm, y=log(Chimp.Overdispersion))) +
    geom_point()

Ok, now let’s look at the loess plots and think about estimating dispersion on a per cell type (and per species) basis.

#loess log(var) vs mean for chimp by cell type
Normalized.Expression.Per.CellType <- Ted.Expression.Per.CellType.df %>%
  gather("Ind", "Expression", -CellType, -gene) %>%
  left_join(TotalBulkCount.df, by="Ind") %>%
  mutate(Log.CPM.Expression=log(Expression/TotalBulkCount*1E6))

#loess fit, just for chimp population
Normalized.Expression.Per.CellType %>%
  filter(startsWith(Ind, "C")) %>%
  group_by(gene, CellType) %>%
  summarise(mu=mean(Log.CPM.Expression), log.var=log(var(Log.CPM.Expression))) %>%
  ggplot(aes(x=mu, y=log.var, color=CellType)) +
  geom_point(alpha=0.05) +
  geom_smooth(method="loess", method.args=list(degree=1)) +
  theme_bw()

#...for both chimp and human
Normalized.Expression.Per.CellType %>%
  mutate(Species = case_when(str_detect(Ind, "C.") ~ "Chimp",
                   str_detect(Ind, "H.") ~ "Human")) %>%
  group_by(gene, CellType, Species) %>%
  summarise(mu=mean(Log.CPM.Expression), log.var=log(var(Log.CPM.Expression))) %>%
  ggplot(aes(x=mu, y=log.var, color=CellType)) +
  geom_point(alpha=0.03) +
  geom_smooth(method="loess", method.args=list(degree=1), aes(linetype=Species), color="black") +
  facet_wrap(~CellType) +
  theme_bw()

It seems most trends are similar between species for different cell types, except leukocytes which I may drop from further analysis. The fact thiat this leukocyte cell type is an outlier may be biologically meaningful, especially given that ~4 of the chimp samples have been challenged with virus. But it might also be technical, I would rather just avoid that cell type.

So now I will calculate the residual from each loess fit as the cell type specific dispersion. Similar to the bulk case, I may use bootstrapping to estimate standard error and perform inference to compare the human vs chimp dispersion estimates.

Let’s start first just by getting dispersion estimates, and worry about bootstrapping and inference later.

DataToEstimateCellTypeDispersion <- Normalized.Expression.Per.CellType %>%
  mutate(Species = case_when(str_detect(Ind, "C.") ~ "Chimp",
                   str_detect(Ind, "H.") ~ "Human")) %>%
  group_by(gene, CellType, Species) %>%
  summarise(mu=mean(Log.CPM.Expression), log.var=log(var(Log.CPM.Expression)))

#Estimate dispersion (residual from the loess fit)
CellTypeDispersion <- DataToEstimateCellTypeDispersion %>%
  group_by(Species, CellType) %>% 
    do(data.frame(., resid = residuals(loess(log.var ~ mu, data=., degree=1, na.action="na.exclude"))))

#Plot kernel density estimate of dispersion... One quick way to verify that I correctly obtained the residual of per-species and per-cell-type loess fits, is to look at the leukoctye in chimp and humans. If I did this correctly, I expect the dispersion estimates for every species x cell type combination to be roughly centered around 0, even the leukocytes.
CellTypeDispersion %>%
  ggplot(aes(x=resid, color=Species)) +
  geom_density() +
  facet_wrap(~CellType) +
  theme_bw()

Ok, it seems I have dispersion estimates for each cell type. How to present this in an interpretable way. How about I start by plotting a correlation matrix with hiearchal clustering dendrogram, comparing the dispersion estimates from bulk to those for each cell type, for both species.

Dispersion.Cor.Matrix <- Dispersion %>%
  dplyr::select(gene, Chimp_Bulk=Chimp.Dispersion, Human_Bulk=Human.Dispersion) %>%
  gather(key="Species.CellType", value="dispersion", -gene) %>%
  bind_rows(
    unite(CellTypeDispersion, Species.CellType, Species, CellType) %>%
      dplyr::select(gene, Species.CellType, dispersion=resid)) %>%
  dplyr::select(gene, Species.CellType, dispersion) %>%
  filter(gene %in% one2one_HumanMouseOrthologs$Gene.stable.ID) %>%
  pivot_wider(names_from=Species.CellType, values_from = dispersion) %>%
  column_to_rownames("gene") %>%
  as.matrix() %>%
  cor(use="pairwise.complete.obs")

SpeciesFactorCol <- colnames(Dispersion.Cor.Matrix) %>% substr(1,1) %>% factor() %>% unclass() %>% as.character() %>% recode("1"="#F8766D", "2"="#00BFC4")

TissueFactorCol <- data.frame(ColNames=colnames(Dispersion.Cor.Matrix)) %>%
  mutate(Tissue=str_remove(ColNames, "^.+?_"))
TissueColors.df <- data.frame(colors=c("black", hue_pal()(9)), Tissue=unique(sort(TissueFactorCol$Tissue)))
TissueFactorCol.vector <- TissueFactorCol %>%
  left_join(TissueColors.df, by="Tissue") %>% pull(colors) %>% as.character()

        
  
heatmap.2(Dispersion.Cor.Matrix, trace="none", ColSideColors=SpeciesFactorCol, RowSideColors = TissueFactorCol.vector)

Let’s contrast this with the correlation matrix for cell type expression estimates:

Expression.Matrix <- Dispersion %>%
  dplyr::select(gene, Chimp_Bulk=Chimp.Mean.Expression, Human_Bulk=Human.Mean.Expression) %>%
  gather(key="Species.CellType", value="expression", -gene) %>%
  bind_rows(
    unite(CellTypeDispersion, Species.CellType, Species, CellType) %>%
      dplyr::select(gene, Species.CellType, expression=mu)) %>%
  dplyr::select(gene, Species.CellType, expression) %>%
  filter(gene %in% one2one_HumanMouseOrthologs$Gene.stable.ID) %>%
  pivot_wider(names_from=Species.CellType, values_from = expression) %>%
  column_to_rownames("gene") %>%
  as.matrix()

Expression.Matrix[mapply(is.infinite, Expression.Matrix)] <- NA
Expression.Cor.Matrix <- cor(Expression.Matrix, use="pairwise.complete.obs")

heatmap.2(Expression.Cor.Matrix, trace="none", ColSideColors=SpeciesFactorCol, RowSideColors = TissueFactorCol.vector)

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.1.0    knitr_1.26      cowplot_1.0.0   gridExtra_2.3  
 [5] edgeR_3.26.8    limma_3.40.6    MASS_7.3-51.4   gplots_3.0.1.1 
 [9] Seurat_3.1.1    biomaRt_2.40.5  forcats_0.4.0   stringr_1.4.0  
[13] dplyr_0.8.3     purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
[17] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0 TED_1.0        

loaded via a namespace (and not attached):
  [1] reticulate_1.13             R.utils_2.9.0              
  [3] tidyselect_0.2.5            RSQLite_2.1.4              
  [5] AnnotationDbi_1.46.1        htmlwidgets_1.5.1          
  [7] grid_3.6.1                  BiocParallel_1.18.1        
  [9] Rtsne_0.15                  munsell_0.5.0              
 [11] mutoss_0.1-12               codetools_0.2-16           
 [13] ica_1.0-2                   future_1.15.1              
 [15] withr_2.1.2                 colorspace_1.4-1           
 [17] Biobase_2.44.0              highr_0.8                  
 [19] rstudioapi_0.10             stats4_3.6.1               
 [21] ROCR_1.0-7                  gbRd_0.4-11                
 [23] listenv_0.8.0               labeling_0.3               
 [25] Rdpack_0.11-0               git2r_0.26.1               
 [27] GenomeInfoDbData_1.2.1      mnormt_1.5-5               
 [29] MCMCpack_1.4-9              farver_2.0.1               
 [31] bit64_0.9-7                 rprojroot_1.3-2            
 [33] TH.data_1.0-10              coda_0.19-3                
 [35] vctrs_0.2.0                 generics_0.0.2             
 [37] xfun_0.11                   R6_2.4.1                   
 [39] GenomeInfoDb_1.20.0         rsvd_1.0.2                 
 [41] locfit_1.5-9.1              bitops_1.0-6               
 [43] DelayedArray_0.10.0         assertthat_0.2.1           
 [45] promises_1.1.0              SDMTools_1.1-221.2         
 [47] multcomp_1.4-11             nnet_7.3-12                
 [49] gtable_0.3.0                npsurv_0.4-0               
 [51] globals_0.12.5              conquer_1.0.2              
 [53] mcmc_0.9-7                  sandwich_2.5-1             
 [55] workflowr_1.5.0             rlang_0.4.1                
 [57] MatrixModels_0.4-1          zeallot_0.1.0              
 [59] genefilter_1.66.0           splines_3.6.1              
 [61] lazyeval_0.2.2              broom_0.5.2                
 [63] checkmate_2.0.0             reshape2_1.4.3             
 [65] yaml_2.2.0                  modelr_0.1.5               
 [67] backports_1.1.5             httpuv_1.5.2               
 [69] Hmisc_4.4-1                 tools_3.6.1                
 [71] ellipsis_0.3.0              RColorBrewer_1.1-2         
 [73] BiocGenerics_0.30.0         ggridges_0.5.1             
 [75] TFisher_0.2.0               Rcpp_1.0.5                 
 [77] plyr_1.8.5                  base64enc_0.1-3            
 [79] progress_1.2.2              zlibbioc_1.30.0            
 [81] RCurl_1.95-4.12             prettyunits_1.0.2          
 [83] rpart_4.1-15                pbapply_1.4-2              
 [85] S4Vectors_0.22.1            zoo_1.8-6                  
 [87] SummarizedExperiment_1.14.1 haven_2.2.0                
 [89] ggrepel_0.8.1               cluster_2.1.0              
 [91] fs_1.3.1                    magrittr_1.5               
 [93] data.table_1.12.8           SparseM_1.78               
 [95] lmtest_0.9-37               reprex_0.3.0               
 [97] RANN_2.6.1                  mvtnorm_1.0-11             
 [99] fitdistrplus_1.0-14         matrixStats_0.56.0         
[101] hms_0.5.2                   lsei_1.2-0                 
[103] evaluate_0.14               xtable_1.8-4               
[105] XML_3.98-1.20               jpeg_0.1-8.1               
[107] readxl_1.3.1                IRanges_2.18.3             
[109] compiler_3.6.1              KernSmooth_2.23-16         
[111] crayon_1.3.4                R.oo_1.23.0                
[113] htmltools_0.4.0             later_1.0.0                
[115] Formula_1.2-3               geneplotter_1.62.0         
[117] RcppParallel_4.4.4          lubridate_1.7.4            
[119] DBI_1.0.0                   dbplyr_1.4.2               
[121] Matrix_1.2-18               cli_2.0.0                  
[123] R.methodsS3_1.7.1           gdata_2.18.0               
[125] parallel_3.6.1              metap_1.2                  
[127] igraph_1.2.4.2              GenomicRanges_1.36.1       
[129] pkgconfig_2.0.3             sn_1.5-4                   
[131] numDeriv_2016.8-1.1         foreign_0.8-72             
[133] plotly_4.9.1                xml2_1.2.2                 
[135] annotate_1.62.0             multtest_2.40.0            
[137] XVector_0.24.0              bibtex_0.4.2               
[139] rvest_0.3.5                 digest_0.6.23              
[141] tsne_0.1-3                  sctransform_0.2.0          
[143] RcppAnnoy_0.0.14            rmarkdown_1.18             
[145] cellranger_1.1.0            leiden_0.3.1               
[147] htmlTable_2.0.1             uwot_0.1.5                 
[149] curl_4.3                    gtools_3.8.1               
[151] quantreg_5.61               lifecycle_0.1.0            
[153] nlme_3.1-143                jsonlite_1.6               
[155] viridisLite_0.3.0           fansi_0.4.0                
[157] pillar_1.4.2                lattice_0.20-38            
[159] plotrix_3.7-7               httr_1.4.1                 
[161] survival_3.1-8              glue_1.3.1                 
[163] png_0.1-7                   bit_1.1-14                 
[165] stringi_1.4.3               blob_1.2.0                 
[167] DESeq2_1.24.0               latticeExtra_0.6-29        
[169] caTools_1.17.1.3            memoise_1.1.0              
[171] irlba_2.3.3                 future.apply_1.3.0         
[173] ape_5.3

ResponseToReviewer_Point11