Single-cell PBMC Example

Last updated: 2022-03-29

Checks: 7 0

Knit directory: logistic-susie-gsea/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20220105)

The command set.seed(20220105) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 7ee315c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7ee315c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    library/
    Ignored:    renv/library/
    Ignored:    renv/staging/
    Ignored:    staging/

Untracked files:
    Untracked:  _targets.R
    Untracked:  _targets.html
    Untracked:  _targets.md
    Untracked:  _targets/
    Untracked:  _targets_r/
    Untracked:  analysis/fetal_reference_cellid_gsea.Rmd
    Untracked:  analysis/fixed_intercept.Rmd
    Untracked:  analysis/iDEA_examples.Rmd
    Untracked:  analysis/latent_gene_list.Rmd
    Untracked:  analysis/latent_logistic_susie.Rmd
    Untracked:  analysis/libra_setup.Rmd
    Untracked:  analysis/linear_method_failure_modes.Rmd
    Untracked:  analysis/linear_regression_failure_regime.Rmd
    Untracked:  analysis/logistic_susie_veb_boost_vs_vb.Rmd
    Untracked:  analysis/references.bib
    Untracked:  analysis/simulations.Rmd
    Untracked:  analysis/test.Rmd
    Untracked:  analysis/wenhe_baboon_example.Rmd
    Untracked:  build_site.R
    Untracked:  cache/
    Untracked:  code/latent_logistic_susie.R
    Untracked:  code/marginal_sumstat_gsea_collapsed.R
    Untracked:  data/adipose_2yr_topsnp.txt
    Untracked:  data/fetal_reference_cellid_gene_sets.RData
    Untracked:  data/pbmc-purified/
    Untracked:  docs.zip
    Untracked:  index.md
    Untracked:  latent_logistic_susie_cache/
    Untracked:  simulation_targets/
    Untracked:  single_cell_pbmc_cache/
    Untracked:  summary_stat_gsea_exploration_cache/

Unstaged changes:
    Modified:   _simulation_targets.R
    Modified:   _targets.Rmd
    Modified:   analysis/gseabenchmark_tcga.Rmd
    Modified:   code/fit_baselines.R
    Modified:   code/fit_logistic_susie.R
    Modified:   code/fit_mr_ash.R
    Modified:   code/fit_susie.R
    Modified:   code/load_gene_sets.R
    Modified:   code/marginal_sumstat_gsea.R
    Modified:   code/simulate_gene_lists.R
    Modified:   code/utils.R
    Modified:   target_components/factories.R
    Modified:   target_components/methods.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/single_cell_pbmc.Rmd) and HTML (docs/single_cell_pbmc.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	7ee315c	karltayeb	2022-03-29	wflow_publish(“analysis/single_cell_pbmc.Rmd”)
html	8314424	karltayeb	2022-03-29	Build site.
Rmd	524d94b	karltayeb	2022-03-29	wflow_publish(“analysis/single_cell_pbmc.Rmd”)
html	61540e8	karltayeb	2022-03-29	Build site.
Rmd	54209af	karltayeb	2022-03-29	wflow_publish(“analysis/single_cell_pbmc.Rmd”)
html	67bf8cf	karltayeb	2022-03-29	Build site.
Rmd	170e480	karltayeb	2022-03-29	wflow_publish(“analysis/single_cell_pbmc.Rmd”)
html	90c6006	karltayeb	2022-03-29	Build site.
Rmd	8159b83	karltayeb	2022-03-29	wflow_publish(“analysis/single_cell_pbmc.Rmd”)
html	9d7fd29	karltayeb	2022-03-29	Build site.
Rmd	51647d4	karltayeb	2022-03-29	wflow_publish(“analysis/single_cell_pbmc.Rmd”)
html	3da7c5b	karltayeb	2022-03-29	Build site.
Rmd	fc1f3c1	karltayeb	2022-03-29	wflow_publish(“analysis/single_cell_pbmc.Rmd”)
html	8646723	karltayeb	2022-03-29	Build site.
Rmd	6143c44	karltayeb	2022-03-29	wflow_publish(“analysis/single_cell_pbmc.Rmd”)
html	56f8130	karltayeb	2022-03-29	Build site.
html	a2bdb56	karltayeb	2022-03-29	Build site.
Rmd	122deec	karltayeb	2022-03-29	wflow_publish(pages)

Introduction

Our goals here are to run Logistic SuSiE on differential expression results from TCGA. We want to assess:

If the resulting enrichment results look good/interpretable across multiple/concatenated gene sets
Assess sensitivity to a range of p-value thresholds
Evaluate the potential of the summary stat latent model

library(GSEABenchmarkeR)
library(EnrichmentBrowser)
library(tidyverse)
library(susieR)
library(DT)
source('code/load_gene_sets.R')
source('code/utils.R')
source('code/logistic_susie_vb.R')
source('code/logistic_susie_veb_boost.R')
source('code/latent_logistic_susie.R')

Setup

Load Gene Sets

loadGeneSetX uniformly formats gene sets and generates the \(X\) matrix We can source any gene set from WebGestaltR::listGeneSet()

gs_list <- WebGestaltR::listGeneSet()
gobp <- loadGeneSetX('geneontology_Biological_Process', min.size=50)  # just huge number of gene sets
gobp_nr <- loadGeneSetX('geneontology_Biological_Process_noRedundant', min.size=1)
gomf <- loadGeneSetX('geneontology_Molecular_Function', min.size=1)
kegg <- loadGeneSetX('pathway_KEGG', min.size=1)
reactome <- loadGeneSetX('pathway_Reactome', min.size=1)
wikipathway_cancer <- loadGeneSetX('pathway_Wikipathway_cancer', min.size=1)
wikipathway <- loadGeneSetX('pathway_Wikipathway', min.size=1)

genesets <- list(
  gobp=gobp,
  gobp_nr=gobp_nr,
  gomf=gomf,
  kegg=kegg,
  reactome=reactome,
  wikipathway_cancer=wikipathway_cancer,
  wikipathway=wikipathway
)

load('data/pbmc-purified/deseq2-pbmc-purified.RData')

convert_labels <- function(y, from='SYMBOL', to='ENTREZID'){
  hs <- org.Hs.eg.db::org.Hs.eg.db
  gene_symbols <- names(y)
  symbol2entrez <- AnnotationDbi::select(hs, keys=gene_symbols, columns=c(to, from), keytype = from)
  symbol2entrez <- symbol2entrez[!duplicated(symbol2entrez[[from]]),]
  symbol2entrez <- symbol2entrez[!is.na(symbol2entrez[[to]]),]
  symbol2entrez <- symbol2entrez[!is.na(symbol2entrez[[from]]),]
  rownames(symbol2entrez) <- symbol2entrez[[from]]
  ysub <- y[names(y) %in% symbol2entrez[[from]]]
  names(ysub) <- symbol2entrez[names(ysub),][[to]]
  return(ysub)
}


par(mfrow=c(1,1))
deseq$`CD19+ B` %>% .$padj %>% hist(main='CD19+B p-values')

Loading required package: DESeq2

Version	Author	Date
a2bdb56	karltayeb	2022-03-29

Fit logistic SuSiE

logistic_susie_driver = function(db, celltype, thresh){
  gs <- genesets[[db]]
  data <- deseq[[celltype]]
  
  # set up binary y
  y <- data %>%
    as.data.frame %>%
    rownames_to_column('gene') %>%
    dplyr::select(gene, padj) %>%
    filter(!is.na(padj)) %>%
    mutate(y = as.integer(padj < thresh)) %>%
    select(gene, y) %>%
    tibble2namedlist %>%
    convert_labels('ENSEMBL')
  
  u <- process_input(gs$X, y)  # subset to common genes
  vb.fit <- logistic.susie(  # fit model
    u$X, u$y, L=10, init.intercept = 0, verbose=1, maxit=100)

  # summarise results
  set.summary <- vb.fit$pip %>% 
    as_tibble(rownames='geneSet') %>%
    rename(pip=value) %>%
    mutate(
      top_component = apply(vb.fit$alpha, 2, which.max),
      active_set = top_component %in% vb.fit$sets$cs_index,
      top_component = paste0('L', top_component),
      cs = purrr::map(top_component, ~tryCatch(
        colnames(gs$X)[get(.x, vb.fit$sets$cs)], error = function(e) list())),
      in_cs = geneSet %in% cs,
      beta = colSums(vb.fit$mu * vb.fit$alpha),
      geneListSize = sum(u$y),
      geneSetSize = colSums(u$X),
      overlap = (u$y %*% u$X)[1,],
      nGenes = length(u$y),
      propSetInList = overlap / geneSetSize,
      oddsRatio = (overlap / (geneListSize - overlap)) / (
        (geneSetSize - overlap) / (nGenes - geneSetSize + overlap)),
    pValueHypergeometric = phyper(
      overlap-1, geneListSize, nGenes, geneSetSize, lower.tail= FALSE),
    db = db,
    celltype = celltype,
    thresh = thresh
    ) %>% left_join(gs$geneSet$geneSetDes)
  return(list(fit = vb.fit, set.summary=set.summary))
}

For each celltype, we fit logistic SuSiE using multiple gene set sources at various threshold of padj.

celltypes <- names(deseq)
pthresh <- c(0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001)
db_name <- names(genesets)
crossed <- cross3(db_name, celltypes, pthresh)

pbmc_res <- xfun::cache_rds({
  res <- purrr::map(crossed, purrr::lift_dl(logistic_susie_driver))
  for (i in 1:length(res)){  # save some space
    res[[i]]$fit$dat <- NULL
  }
  res
  }, file = 'logistic_susie_pbmc_genesets_pthresh.rds'
)

pbmc_res_set_summary <- dplyr::bind_rows(purrr::map(pbmc_res, ~ pluck(.x, 'set.summary')))

Summary functions

Just a few functions to help streamline looking at output

pval_focussed_table = function(thresh=1e-3, filter_db=NULL, filter_celltype=NULL, top.n=50){
  pbmc_res_set_summary %>%
  filter(
    case_when(
      is.null(filter_db) ~ TRUE,
      !is.null(filter_db) ~ db %in% filter_db
    ) &
    thresh == thresh &
    case_when(
      is.null(filter_celltype) ~ TRUE,
      !is.null(filter_celltype) ~ celltype %in% filter_celltype
    )
  )  %>%
  dplyr::arrange(celltype, db, pValueHypergeometric) %>%
  group_by(celltype, db) %>% slice(1:top.n) %>%
  select(celltype, db, geneSet, description, pip, top_component, oddsRatio, propSetInList, pValueHypergeometric) %>%
  mutate_at(vars(celltype, db), factor) %>%
  datatable(filter = 'top')
}

set_focussed_table = function(thresh=1e-3, filter_db=NULL, filter_celltype=NULL){
  pbmc_res_set_summary %>%
  filter(
    case_when(
      is.null(filter_db) ~ TRUE,
      !is.null(filter_db) ~ db %in% filter_db
    ) &
    thresh == 1e-3 &
    in_cs & active_set &
    case_when(
      is.null(filter_celltype) ~ TRUE,
      !is.null(filter_celltype) ~ celltype %in% filter_celltype
    )
  )  %>%
  dplyr::arrange(celltype, db, desc(pip)) %>%
  select(celltype, db, geneSet, description, pip, top_component, oddsRatio, propSetInList, pValueHypergeometric) %>%
  mutate_at(vars(celltype, geneSet, db), factor) %>%
  datatable(filter = 'top')
}

#' takes a tibble
#' organize by database and component
#' report credible set, descriptions, pips, and hypergeometric pvalue
#' in one row, with cs ordered by pip
db_component_kable = function(tbl){
  tbl %>%
  filter(active_set, thresh==1e-4) %>%
  group_by(db, celltype, top_component) %>%
  arrange(db, celltype, top_component, desc(pip)) %>%
  select(geneSet, description, pip, pValueHypergeometric) %>%
  chop(c(geneSet, description, pip, pValueHypergeometric)) %>%
  knitr::kable()
}

Results/Explore enrichments

Our goal is to assess 1. The quality of the gene set enrichments we get from each celltype - do reported gene set enrichments seem celltype specific/celltype relevant? - how much “interesting” marginal enrichment do we fail to capture in the multivariate model - how sensitive are we to the choice of pvalue threshold

Results

Lets take a look at what enrichment we’re getting across cell-types.

CD19+ B

pbmc_res_set_summary %>%
  filter(celltype == 'CD19+ B') %>%
  filter(active_set, in_cs, thresh==1e-4) %>%
  db_component_kable

Adding missing grouping variables: `db`, `celltype`, `top_component`

db	celltype	top_component	geneSet	description	pip	pValueHypergeometric
gobp	CD19+ B	L1	GO:0002376	immune system process	0.9999575	2.526664e-269
gobp	CD19+ B	L2	GO:0045047	protein targeting to ER	0.9620521	2.532661e-36
gobp	CD19+ B	L5	GO:0001775	cell activation	0.9780542	4.428372e-165
gobp_nr	CD19+ B	L2	GO:0070972	protein localization to endoplasmic reticulum	0.999958	7.382525e-33
gobp_nr	CD19+ B	L4	GO:0009123	nucleoside monophosphate metabolic process	0.9844976	7.371703e-40
gobp_nr	CD19+ B	L5	GO:0002764	immune response-regulating signaling pathway	0.9961853	8.086259e-55
gomf	CD19+ B	L1	GO:0003723	RNA binding	0.999744	3.707404e-127
gomf	CD19+ B	L2	GO:0000981	DNA-binding transcription factor activity, RNA polymerase II-specific	0.9950038	1.942472e-16
gomf	CD19+ B	L6	GO:0003735	structural constituent of ribosome	0.9953441	2.994289e-32
kegg	CD19+ B	L1	hsa00190	Oxidative phosphorylation	0.9979902	5.027625e-29
kegg	CD19+ B	L2	hsa04640	Hematopoietic cell lineage	0.9997951	8.551153e-21
kegg	CD19+ B	L3	hsa03010	Ribosome	1	3.24636e-30
reactome	CD19+ B	L1	R-HSA-168256	Immune System	1	7.929629e-180
reactome	CD19+ B	L5	R-HSA-983168	Antigen processing: Ubiquitination & Proteasome degradation	0.9970497	6.660057e-09
wikipathway	CD19+ B	L1	WP477	Cytoplasmic Ribosomal Proteins	1	1.254966e-28
wikipathway	CD19+ B	L2	WP111	Electron Transport Chain (OXPHOS system in mitochondria)	0.9999189	4.726328e-27
wikipathway_cancer	CD19+ B	L1	WP619	Type II interferon signaling (IFNG)	0.9998636	2.260695e-11

CD56+ NK

pbmc_res_set_summary %>%
  filter(celltype == 'CD56+ NK') %>%
  filter(active_set, in_cs, thresh==1e-4) %>%
  db_component_kable

Adding missing grouping variables: `db`, `celltype`, `top_component`

db	celltype	top_component	geneSet	description	pip	pValueHypergeometric
gobp	CD56+ NK	L1	GO:0002376	immune system process	0.9999984	2.947781e-276
gobp	CD56+ NK	L4	GO:0006119	oxidative phosphorylation	0.9815754	2.922476e-26
gobp_nr	CD56+ NK	L2	GO:0006413	translational initiation	0.9954877	9.230939e-37
gobp_nr	CD56+ NK	L3	GO:0042110	T cell activation	0.9856601	3.058385e-58
gobp_nr	CD56+ NK	L4	GO:0009123	nucleoside monophosphate metabolic process	0.9815927	5.279727e-40
gobp_nr	CD56+ NK	L5	GO:0042113	B cell activation	0.9972725	9.375193e-38
gomf	CD56+ NK	L1	GO:0003735	structural constituent of ribosome	1	1.182382e-44
gomf	CD56+ NK	L2	GO:0000981	DNA-binding transcription factor activity, RNA polymerase II-specific	0.9967706	3.240357e-18
kegg	CD56+ NK	L1	hsa03010	Ribosome	1	9.217956e-36
kegg	CD56+ NK	L2	hsa05012	Parkinson disease	0.9999978	2.585983e-32
kegg	CD56+ NK	L3	hsa04640	Hematopoietic cell lineage	0.9978541	4.769128e-19
reactome	CD56+ NK	L1	R-HSA-168256	Immune System	1	9.313458e-193
reactome	CD56+ NK	L4	R-HSA-163200	Respiratory electron transport, ATP synthesis by chemiosmotic coupling, and heat production by uncoupling proteins.	0.9933245	1.782486e-25
reactome	CD56+ NK	L5	R-HSA-8878171	Transcriptional regulation by RUNX1	0.9712818	4.649106e-25
wikipathway	CD56+ NK	L1	WP477	Cytoplasmic Ribosomal Proteins	1	3.731258e-31
wikipathway	CD56+ NK	L2	WP111	Electron Transport Chain (OXPHOS system in mitochondria)	0.990862	3.36456e-24

T cell

pbmc_res_set_summary %>%
  filter(celltype == 'T cell') %>%
  filter(active_set, in_cs, thresh==1e-4) %>%
  db_component_kable

Adding missing grouping variables: `db`, `celltype`, `top_component`

db	celltype	top_component	geneSet	description	pip	pValueHypergeometric
gobp	T cell	L1	GO:0001775	cell activation	0.9982435	1.098801e-248
gobp	T cell	L2	GO:0006119	oxidative phosphorylation	0.9988673	1.89124e-33
gobp	T cell	L5	GO:0002376	immune system process	0.9984689	0
gobp_nr	T cell	L2	GO:0042110	T cell activation	0.9999906	2.423836e-82
gobp_nr	T cell	L3	GO:0070972	protein localization to endoplasmic reticulum	0.9998559	7.082395e-33
gomf	T cell	L1	GO:0005515	protein binding	1	0
kegg	T cell	L1	hsa05010	Alzheimer disease	0.9999993	1.469698e-43
reactome	T cell	L1	R-HSA-6798695	Neutrophil degranulation	1	4.546436e-106

CD14+ Monocyte

pbmc_res_set_summary %>%
  filter(celltype == 'CD14+ Monocyte') %>%
  filter(active_set, in_cs, thresh==1e-4) %>%
  db_component_kable

Adding missing grouping variables: `db`, `celltype`, `top_component`

db	celltype	top_component	geneSet	description	pip	pValueHypergeometric
gobp	CD14+ Monocyte	L3	GO:0006119	oxidative phosphorylation	0.9990485	1.775985e-31
gobp	CD14+ Monocyte	L4	GO:0016192	vesicle-mediated transport	0.9851434	2.927115e-145
gobp_nr	CD14+ Monocyte	L1	GO:0036230	granulocyte activation	0.975928	2.402996e-86
gobp_nr	CD14+ Monocyte	L2	GO:0006413	translational initiation	0.999994	1.131814e-41
gobp_nr	CD14+ Monocyte	L3	GO:0009123	nucleoside monophosphate metabolic process	0.9883528	5.790725e-44
gomf	CD14+ Monocyte	L1	GO:0003723	RNA binding	0.9999998	1.173386e-141
gomf	CD14+ Monocyte	L2	GO:0000981	DNA-binding transcription factor activity, RNA polymerase II-specific	0.9998275	1.396985e-15
gomf	CD14+ Monocyte	L4	GO:0003735	structural constituent of ribosome	1	4.918061e-44
kegg	CD14+ Monocyte	L1	hsa03010	Ribosome	1	7.453734e-35
kegg	CD14+ Monocyte	L2	hsa05012	Parkinson disease	0.9999021	2.609968e-33
reactome	CD14+ Monocyte	L1	R-HSA-6798695	Neutrophil degranulation	0.9999954	3.105498e-81
reactome	CD14+ Monocyte	L2	R-HSA-72766	Translation	1	5.172352e-59
reactome	CD14+ Monocyte	L3	R-HSA-163200	Respiratory electron transport, ATP synthesis by chemiosmotic coupling, and heat production by uncoupling proteins.	0.9996111	1.294295e-30
reactome	CD14+ Monocyte	L6	R-HSA-198933	Immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell	0.9974877	5.102985e-21
reactome	CD14+ Monocyte	L7	R-HSA-379726	Mitochondrial tRNA aminoacylation	0.9972986	0.6931782
wikipathway	CD14+ Monocyte	L1	WP477	Cytoplasmic Ribosomal Proteins	1	1.207268e-29
wikipathway	CD14+ Monocyte	L2	WP111	Electron Transport Chain (OXPHOS system in mitochondria)	0.9998312	9.058329e-28

CD34+

pbmc_res_set_summary %>%
  filter(celltype == 'CD14+') %>%
  filter(active_set, in_cs, thresh==1e-4) %>%
  db_component_kable

Adding missing grouping variables: `db`, `celltype`, `top_component`

db	celltype	top_component	geneSet	description	pip	pValueHypergeometric

knitr::knit_exit()