Visualize and interpret topics in Cusanovich2018 data

Last updated: 2020-09-11

Checks: 7 0

Knit directory: scATACseq-topics/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200729)

The command set.seed(20200729) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: dc70e46

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version dc70e46. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/plots_Cusanovich2018.Rmd) and HTML (docs/plots_Cusanovich2018.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	dc70e46	kevinlkx	2020-09-11	add t-SNE and membership pie charts
html	5a8479b	kevinlkx	2020-09-10	Build site.
Rmd	b9993e2	kevinlkx	2020-09-10	initial structure plots

Here we examine and compare the topic modeling results for the scATAC-seq dataset from the mouse single-cell atlas paper Cusanovich et al (2018)

Load the packages used in the analysis below, as well as additional functions that will be used to generate some of the plots.

library(tools)
library(dplyr)
library(fastTopics)
library(ggplot2)
library(cowplot)
library(mapplots)
source("code/plots.R")

set.seed(1)

Cusanovich 2018 mouse scATAC-seq dataset

Load the data and Poisson NMF model fit

Load the data

data.dir <- "/project2/mstephens/kevinluo/scATACseq-topics/data/Cusanovich_2018/processed_data/"
load(file.path(data.dir, "Cusanovich_2018.RData"))
cat(sprintf("%d x %d counts matrix.\n",nrow(counts),ncol(counts)))
rm(counts)
# 81173 x 436206 counts matrix.

About the samples: The study measured single cell chromatin accessibility for 17 samples spanning 13 different tissues in 8-week old mice.

cat(nrow(samples), "samples (cells). \n")
# 81173 samples (cells).

Tissues:

samples$tissue <- as.factor(samples$tissue)
cat(length(levels(samples$tissue)), "tissues. \n")
table(samples$tissue)
# 13 tissues. 
# 
#       BoneMarrow       Cerebellum            Heart           Kidney 
#             8403             2278             7650             6431 
#   LargeIntestine            Liver             Lung PreFrontalCortex 
#             7086             6167             9996             5959 
#   SmallIntestine           Spleen           Testes           Thymus 
#             4077             4020             2723             7617 
#       WholeBrain 
#             8766

Cell types labels:

samples$cell_label <- as.factor(samples$cell_label)
cat(length(levels(samples$cell_label)), "cell types \n")
table(samples$cell_label)
# 40 cell types 
# 
#          Activated B cells       Alveolar macrophages 
#                        500                        559 
#                 Astrocytes                    B cells 
#                       1666                       5772 
#             Cardiomyocytes   Cerebellar granule cells 
#                       4076                       4099 
#            Collecting duct                 Collisions 
#                        164                       1218 
#                     DCT/CD            Dendritic cells 
#                        506                        958 
#   Distal convoluted tubule Endothelial I (glomerular) 
#                        319                        552 
#        Endothelial I cells       Endothelial II cells 
#                        952                       3019 
#                Enterocytes              Erythroblasts 
#                       4783                       2811 
#            Ex. neurons CPN          Ex. neurons CThPN 
#                       1832                       1540 
#           Ex. neurons SCPN  Hematopoietic progenitors 
#                       1466                       3425 
#                Hepatocytes           Immature B cells 
#                       5664                        571 
#         Inhibitory neurons              Loop of henle 
#                       1828                        815 
#                Macrophages                  Microglia 
#                        711                        422 
#                  Monocytes                   NK cells 
#                       1173                        321 
#           Oligodendrocytes                  Podocytes 
#                       1558                        498 
#            Proximal tubule         Proximal tubule S3 
#                       2570                        594 
#             Purkinje cells         Regulatory T cells 
#                        320                        507 
#          SOM+ Interneurons                      Sperm 
#                        553                       2089 
#                    T cells         Type I pneumocytes 
#                       8954                       1622 
#        Type II pneumocytes                    Unknown 
#                        157                      10029

Load the results of running fit_poisson_nmf on the Cusanovich2018 data, with different algorithms, and for various choices of \(k\) (the number of “topics”).

out.dir <- "/project2/mstephens/kevinluo/scATACseq-topics/output/Cusanovich_2018"
load(file.path(out.dir, "/compiled.fits.Cusanovich2018.RData"))

Select all scd-ex model fits

fits <- fits[grep("scd-ex", names(fits))]

Explore the structure of the single-cell data as inferred by the topic model.

Structure plot

The structure plots below summarize the topic proportions in the samples grouped by different tissues.

\(k = 2\):

fit_poisson_nmf <- fits[["fit-Cusanovich2018-scd-ex-k=2"]]

p.structure_plot <- structure_plot(poisson2multinom(fit_poisson_nmf),
                     grouping = samples[,"tissue"],
                     n = 2000,gap = 40,num_threads = 4,verbose = FALSE)
print(p.structure_plot)

Version	Author	Date
5a8479b	kevinlkx	2020-09-10

\(k = 3\):

fit_poisson_nmf <- fits[["fit-Cusanovich2018-scd-ex-k=3"]]

p.structure_plot <- structure_plot(poisson2multinom(fit_poisson_nmf),
                     grouping = samples[,"tissue"],
                     n = 2000,gap = 40,num_threads = 4,verbose = FALSE)
print(p.structure_plot)

Version	Author	Date
5a8479b	kevinlkx	2020-09-10

\(k = 4\):

fit_poisson_nmf <- fits[["fit-Cusanovich2018-scd-ex-k=4"]]

p.structure_plot <- structure_plot(poisson2multinom(fit_poisson_nmf),
                     grouping = samples[,"tissue"],
                     n = 2000,gap = 40,num_threads = 4,verbose = FALSE)
print(p.structure_plot)

Version	Author	Date
5a8479b	kevinlkx	2020-09-10

\(k = 5\):

fit_poisson_nmf <- fits[["fit-Cusanovich2018-scd-ex-k=5"]]

p.structure_plot <- structure_plot(poisson2multinom(fit_poisson_nmf),
                     grouping = samples[,"tissue"],
                     n = 2000,gap = 40,num_threads = 4,verbose = FALSE)
print(p.structure_plot)

Version	Author	Date
5a8479b	kevinlkx	2020-09-10

\(k = 6\):

fit_poisson_nmf <- fits[["fit-Cusanovich2018-scd-ex-k=6"]]

p.structure_plot <- structure_plot(poisson2multinom(fit_poisson_nmf),
                     grouping = samples[,"tissue"],
                     n = 2000,gap = 40,num_threads = 4,verbose = FALSE)
print(p.structure_plot)

Version	Author	Date
5a8479b	kevinlkx	2020-09-10

\(k = 7\):

fit_poisson_nmf <- fits[["fit-Cusanovich2018-scd-ex-k=7"]]

p.structure_plot <- structure_plot(poisson2multinom(fit_poisson_nmf),
                     grouping = samples[,"tissue"],
                     n = 2000,gap = 40,num_threads = 4,verbose = FALSE)
print(p.structure_plot)

Version	Author	Date
5a8479b	kevinlkx	2020-09-10

\(k = 8\):

fit_poisson_nmf <- fits[["fit-Cusanovich2018-scd-ex-k=8"]]

p.structure_plot <- structure_plot(poisson2multinom(fit_poisson_nmf),
                     grouping = samples[,"tissue"],
                     n = 2000,gap = 40,num_threads = 4,verbose = FALSE)
print(p.structure_plot)

Version	Author	Date
5a8479b	kevinlkx	2020-09-10

\(k = 9\):

fit_poisson_nmf <- fits[["fit-Cusanovich2018-scd-ex-k=9"]]

p.structure_plot <- structure_plot(poisson2multinom(fit_poisson_nmf),
                     grouping = samples[,"tissue"],
                     n = 2000,gap = 40,num_threads = 4,verbose = FALSE)
print(p.structure_plot)

Version	Author	Date
5a8479b	kevinlkx	2020-09-10

\(k = 10\):

fit_poisson_nmf <- fits[["fit-Cusanovich2018-scd-ex-k=10"]]

p.structure_plot <- structure_plot(poisson2multinom(fit_poisson_nmf),
                     grouping = samples[,"tissue"],
                     n = 2000,gap = 40,num_threads = 4,verbose = FALSE)
print(p.structure_plot)

Version	Author	Date
5a8479b	kevinlkx	2020-09-10

Plot samples by tissues on the t-SNE coordinates provided by the original paper

Plot all samples by tissues:

p.tissues_tsne <- ggplot(samples, aes(x = tsne_1, y = tsne_2)) + 
  geom_point(aes(colour = tissue), size = 0.5) + 
  labs(x = 't-SNE 1', y = 't-SNE 2', title = '') + 
  scale_color_discrete('') +
  theme_classic()

print(p.tissues_tsne)

Plot a random subset of 2000 samples by tissues:

idx_plot <- sort(sample(1:length(samples$cell), 2000))

p.tissues_tsne <- ggplot(samples[idx_plot,], aes(x = tsne_1, y = tsne_2)) + 
  geom_point(aes(colour = tissue), size = 0.5) + 
  labs(x = 't-SNE 1', y = 't-SNE 2', title = '') + 
  scale_color_discrete('') +
  theme_classic()

print(p.tissues_tsne)

Visualize topic membership using pie-charts for cells on t-SNE coordinates provided by the original paper

Plot the subset of 2000 samples: