Last updated: 2021-02-03
Checks: 7 0
Knit directory: scATACseq-topics/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it's best to always run the code in an empty environment.
The command set.seed(20200729)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 4b74e1e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: output/plotly/Buenrostro_2018_Chen2019pipeline/
Untracked files:
Untracked: analysis/gene_analysis_Buenrostro2018_Chen2019pipeline.Rmd
Untracked: analysis/process_data_Buenrostro2018_Chen2019.Rmd
Untracked: output/clustering-Cusanovich2018.rds
Untracked: scripts/fit_all_models_Buenrostro_2018_chromVar_scPeaks_filtered.sbatch
Untracked: scripts/fit_models_Cusanovich2018_tissues.sh
Unstaged changes:
Modified: analysis/assess_fits_Cusanovich2018.Rmd
Modified: analysis/gene_analysis_Cusanovich2018.Rmd
Modified: analysis/index.Rmd
Modified: analysis/motif_analysis_Buenrostro2018_Chen2019pipeline.Rmd
Modified: analysis/plots_Cusanovich2018.Rmd
Modified: analysis/process_data_Cusanovich2018.Rmd
Modified: scripts/fit_poisson_nmf.sbatch
Modified: scripts/prefit_poisson_nmf.sbatch
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/clusters_Cusanovich2018_k13.Rmd
) and HTML (docs/clusters_Cusanovich2018_k13.html
) files. If you've configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 4b74e1e | kevinlkx | 2021-02-03 | add cell label x cluster table and heatmap |
html | 4b2facc | kevinlkx | 2021-02-03 | Build site. |
Rmd | 98f5f51 | kevinlkx | 2021-02-03 | minor change on the knit settings |
html | 3770d7e | kevinlkx | 2021-02-03 | Build site. |
Rmd | 91cab4b | kevinlkx | 2021-02-03 | added k-means results on PCA of topic proportions, and merged two clusters, and moved PCA plots to the end |
html | f8ec9d7 | kevinlkx | 2020-12-02 | Build site. |
Rmd | b30b5c6 | kevinlkx | 2020-12-02 | print outlier cells in c15 |
html | 24c6d8b | kevinlkx | 2020-12-02 | Build site. |
Rmd | 586af90 | kevinlkx | 2020-12-02 | wflow_publish("analysis/clusters_Cusanovich2018_k13.Rmd") |
html | 019f436 | kevinlkx | 2020-12-02 | Build site. |
Rmd | 06efc2b | kevinlkx | 2020-12-02 | zoom into a few clusters |
html | 96392bf | kevinlkx | 2020-12-02 | Build site. |
Rmd | 8db261b | kevinlkx | 2020-12-02 | wflow_publish("analysis/clusters_Cusanovich2018_k13.Rmd") |
html | deead06 | kevinlkx | 2020-12-01 | Build site. |
Rmd | 2af43a8 | kevinlkx | 2020-12-01 | update tissue colors |
html | 06b9bc9 | kevinlkx | 2020-12-01 | Build site. |
Rmd | bc7e770 | kevinlkx | 2020-12-01 | update tissue colors |
html | 6c97fa9 | kevinlkx | 2020-12-01 | Build site. |
Rmd | ab321de | kevinlkx | 2020-12-01 | refine kmeans clusters and add cell label distributions |
html | e57dfd4 | kevinlkx | 2020-10-21 | Build site. |
Rmd | a88649d | kevinlkx | 2020-10-21 | wflow_publish("analysis/clusters_Cusanovich2018_k13.Rmd") |
Rmd | 137c252 | kevinlkx | 2020-10-21 | set toc = yes |
html | 4c0ab95 | kevinlkx | 2020-10-21 | Build site. |
html | 9d31a9f | kevinlkx | 2020-10-21 | Build site. |
Rmd | c00f12b | kevinlkx | 2020-10-21 | update colors and interpret clusters with tissue labels |
html | 32ed5ae | kevinlkx | 2020-10-21 | Build site. |
Rmd | 82292d5 | kevinlkx | 2020-10-21 | update colors and interpret clusters with tissue labels |
html | b8a48b9 | kevinlkx | 2020-10-20 | Build site. |
Rmd | a4daf6d | kevinlkx | 2020-10-20 | clustering with k = 13 topics |
html | ac8ca65 | kevinlkx | 2020-10-20 | Build site. |
Rmd | 98920ed | kevinlkx | 2020-10-20 | clustering with k = 13 topics |
html | a38788b | kevinlkx | 2020-10-20 | Build site. |
Rmd | f8bea96 | kevinlkx | 2020-10-20 | clustering with k = 13 topics |
Here we explore the structure in the Cusanovich et al (2018) ATAC-seq data inferred from the multinomial topic model with \(k = 13\).
Load packages and some functions used in this analysis.
library(Matrix)
library(dplyr)
library(ggplot2)
library(cowplot)
library(plyr)
library(dplyr)
library(RColorBrewer)
library(fastTopics)
library(DT)
library(reshape)
source("code/plots.R")
Load the data. The counts are no longer needed at this stage of the analysis.
data.dir <- "/project2/mstephens/kevinluo/scATACseq-topics/data/Cusanovich_2018/processed_data/"
load(file.path(data.dir, "Cusanovich_2018.RData"))
rm(counts)
Load the model fit
fit.dir <- "/project2/mstephens/kevinluo/scATACseq-topics/output/Cusanovich_2018"
fit <- readRDS(file.path(fit.dir, "/fit-Cusanovich2018-scd-ex-k=13.rds"))$fit
fit_multinom <- poisson2multinom(fit)
About the samples: The study measured single cell chromatin accessibility for 17 samples spanning 13 different tissues in 8-week old mice.
cat(nrow(samples), "samples (cells). \n")
samples$tissue <- as.factor(samples$tissue)
cat(length(levels(samples$tissue)), "tissues. \n")
# table(samples$tissue)
colors_tissues <- c("darkblue", # BoneMarrow
"gray30", # Cerebellum
"red", # Heart
"springgreen", # Kidney
"brown", # LargeIntestine
"purple", # Liver
"deepskyblue", # Lung
"black", # PreFrontalCortex
"darkgreen", # SmallIntestine
"plum", # Spleen
"gold", # Testes
"darkorange", # Thymus
"gray") # WholeBrain
samples$cell_label <- as.factor(samples$cell_label)
cat(length(levels(samples$cell_label)), "cell types \n")
# 81173 samples (cells).
# 13 tissues.
# 40 cell types
The structure plots below summarize the topic proportions in the samples grouped by different tissues.
set.seed(10)
colors_topics <- c("#a6cee3","#1f78b4","#b2df8a","#33a02c","#fb9a99","#e31a1c",
"#fdbf6f","#ff7f00","#cab2d6","#6a3d9a","#ffff99","#b15928",
"gray")
samples$tissue <- as.factor(samples$tissue)
rows <- sample(nrow(fit$L),4000)
p.structure.1 <- structure_plot(select(fit_multinom,loadings = rows),
grouping = samples[rows, "tissue"],n = Inf,gap = 40,
perplexity = 50,topics = 1:13,colors = colors_topics,
num_threads = 4,verbose = FALSE)
print(p.structure.1)
Version | Author | Date |
---|---|---|
3770d7e | kevinlkx | 2021-02-03 |
Define clusters using k-means, and then create structure plot based on the clusters from k-means.
k-means clustering (using 15 clusters) on topic proportions
set.seed(10)
clusters <- factor(kmeans(fit_multinom$L,centers = 15,iter.max = 100)$cluster)
summary(clusters)
rows <- sample(nrow(fit$L),4000)
p.structure.kmeans <- structure_plot(select(fit_multinom,loadings = rows),
grouping = clusters[rows],n = Inf,gap = 20,
perplexity = 50,topics = 1:13,colors = colors_topics,
num_threads = 4,verbose = FALSE)
print(p.structure.kmeans)
# 1 2 3 4 5 6 7 8 9 10 11 12 13
# 2740 3868 3775 3879 3645 3651 12037 2955 6045 12026 6068 3582 9319
# 14 15
# 3411 4172
First do PCA on topic proportions and then do k-means clustering (using 15 clusters). Results are the same as the results from running k-means directly on the topic proportions.
set.seed(10)
pca <- prcomp(fit_multinom$L)$x
clusters2 <- factor(kmeans(pca,centers = 15,iter.max = 100)$cluster)
summary(clusters2)
rows <- sample(nrow(fit$L),4000)
p.structure.kmeans <- structure_plot(select(fit_multinom,loadings = rows),
grouping = clusters2[rows],n = Inf,gap = 20,
perplexity = 50,topics = 1:13,colors = colors_topics,
num_threads = 4,verbose = FALSE)
print(p.structure.kmeans)
Version | Author | Date |
---|---|---|
3770d7e | kevinlkx | 2021-02-03 |
length(which(clusters != clusters2))
# 1 2 3 4 5 6 7 8 9 10 11 12 13
# 2740 3868 3775 3879 3645 3651 12037 2955 6045 12026 6068 3582 9319
# 14 15
# 3411 4172
# [1] 0
Refine k-means clusters:
Structure plot based on refined clusters
set.seed(10)
clusters.refined <- as.numeric(clusters)
clusters.refined[which(clusters == 9)] <- 4
# clusters.refined[which(clusters == 10 & fit_multinom$L[,8] >= 0.2)] <- 16
# clusters.refined[which(clusters == 10 & fit_multinom$L[,8] < 0.2)] <- 17
clusters.refined[which(clusters == 10)] <- kmeans(fit_multinom$L[which(clusters == 10), ],centers = 2)$cluster+15
clusters.refined <- factor(clusters.refined, labels = paste0("c", c(1:length(unique(clusters.refined)))))
samples$cluster_kmeans <- clusters.refined
rows <- sample(nrow(fit$L),4000)
p.structure.kmeans.refined <- structure_plot(select(fit_multinom,loadings = rows),
grouping = clusters.refined[rows],n = Inf,gap = 40,
perplexity = 50,topics = 1:13,colors = colors_topics,
num_threads = 4,verbose = FALSE)
print(p.structure.kmeans.refined)
Save the clustering results to an RDS file
out.dir <- "/project2/mstephens/kevinluo/scATACseq-topics/output/Cusanovich_2018"
saveRDS(samples, paste0(out.dir, "/samples-clustering-Cusanovich2018.rds"))
cat("Result saved to:", paste0(out.dir, "/samples-clustering-Cusanovich2018.rds"), "\n")
samples <- readRDS(paste0(out.dir, "/samples-clustering-Cusanovich2018.rds"))
print(table(samples$cluster_kmeans))
# Result saved to: /project2/mstephens/kevinluo/scATACseq-topics/output/Cusanovich_2018/samples-clustering-Cusanovich2018.rds
#
# c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13
# 2740 3868 3775 9924 3645 3651 12037 2955 6068 3582 9319 3411 4172
# c14 c15
# 6423 5603
Tissue composition in each cluster:
freq_cluster_tissue <- count(samples, vars=c("cluster_kmeans","tissue"))
# stacked barplot for the counts of tissues by clusters
p.barplot <- ggplot(freq_cluster_tissue, aes(fill=tissue, y=freq, x=cluster_kmeans)) +
geom_bar(position="fill", stat="identity") +
theme_classic() + xlab("Cluster") + ylab("Proportion of cells") +
scale_fill_manual(values = colors_tissues) +
guides(fill=guide_legend(ncol=2)) +
theme(
legend.title = element_text(size = 10),
legend.text = element_text(size = 8)
)
print(p.barplot)
We can see a few clusters are tissue specific:
Some clusters are combinations of related tissues:
Some clusters are more heterogeneous mixtures of different tissues.
Distribution of tissue labels by cluster.
freq_table_cluster_tissue <- with(samples,table(tissue,cluster_kmeans))
freq_table_cluster_tissue <- as.data.frame.matrix(freq_table_cluster_tissue)
DT::datatable(freq_table_cluster_tissue,
options = list(pageLength = nrow(freq_table_cluster_tissue)),
rownames = T, caption = "Number of cells")
create_celllabel_cluster_heatmap(samples$tissue, samples$cluster_kmeans, normalize_by = "column")
Distribution of cell labels by cluster.
freq_table_cluster_celllabel <- with(samples,table(cell_label,cluster_kmeans))
freq_table_cluster_celllabel <- as.data.frame.matrix(freq_table_cluster_celllabel)
DT::datatable(freq_table_cluster_celllabel,
options = list(pageLength = 15),
rownames = T, caption = "Number of cells")
create_celllabel_cluster_heatmap(samples$cell_label, samples$cluster_kmeans, normalize_by = "column")
Structure plot for c2 cluster: Group samples by tissue labels first, then by cell labels
rows.c2 <- which(samples$cluster_kmeans == "c2")
tissue_labels_cluster <- as.factor(as.character(samples[rows.c2, "tissue"]))
cell_labels_cluster <- as.factor(as.character(samples[rows.c2, "cell_label"]))
sort(table(tissue_labels_cluster), decreasing = T)
sort(table(cell_labels_cluster), decreasing = T)
p.structure <- structure_plot(select(fit_multinom,loadings = rows.c2),
grouping = tissue_labels_cluster,
rows = order(cell_labels_cluster),
n = Inf,gap = 40,
perplexity = 50,topics = 1:13,colors = colors_topics,
num_threads = 4,verbose = FALSE)
print(p.structure)
# tissue_labels_cluster
# Heart Lung Kidney Liver
# 1374 1093 757 268
# WholeBrain PreFrontalCortex LargeIntestine Cerebellum
# 116 75 73 51
# SmallIntestine Testes BoneMarrow Spleen
# 22 19 15 3
# Thymus
# 2
# cell_labels_cluster
# Endothelial II cells Endothelial I cells
# 2405 512
# Endothelial I (glomerular) Podocytes
# 386 369
# Unknown Cardiomyocytes
# 167 15
# Type II pneumocytes Collisions
# 8 2
# Loop of henle Microglia
# 1 1
# Proximal tubule T cells
# 1 1
breaks <- c(0,1,5,10,100,Inf)
fit.c2 <- select(fit_multinom,loadings = rows.c2)
p.pca.c2 <- pca_plot(fit.c2,fill = "none")
p.pca.hexbin.c2 <- pca_hexbin_plot(fit.c2,breaks = breaks) + guides(fill = "none")
plot_grid(p.pca.c2, p.pca.hexbin.c2)
The variation in PCs 1 and 2 is mostly produced by topics 3,13
p.pca.c2.topics <- pca_plot(fit.c2,k = c(3,13))
print(p.pca.c2.topics)
colors_celllabels_c2 <- c("firebrick","dodgerblue","forestgreen",
"darkmagenta","darkorange","gold","darkblue",
"peru","greenyellow","olivedrab",
"darkgray", "black")
p.pca.c2.tissue <- labeled_pca_plot(fit.c2,1:2,samples[rows.c2, "tissue"],font_size = 7,
colors = colors_tissues, legend_label = "tissue")
p.pca.c2.cell <- labeled_pca_plot(fit.c2,1:2,samples[rows.c2, "cell_label"],font_size = 7,
colors = colors_celllabels_c2, legend_label = "cell_label")
plot_grid(p.pca.c2.tissue,p.pca.c2.cell,rel_widths = c(8,9))
Structure plot of c14 cluster: Group samples by tissue labels first, then by cell labels
rows.c14 <- which(samples$cluster_kmeans == "c14")
tissue_labels_cluster <- as.factor(as.character(samples[rows.c14, "tissue"]))
cell_labels_cluster <- as.factor(as.character(samples[rows.c14, "cell_label"]))
sort(table(tissue_labels_cluster), decreasing = T)
sort(table(cell_labels_cluster), decreasing = T)
p.structure <- structure_plot(select(fit_multinom,loadings = rows.c14),
grouping = tissue_labels_cluster,
rows = order(cell_labels_cluster),
n = Inf,gap = 40,
perplexity = 50,topics = 1:13,colors = colors_topics,
num_threads = 4,verbose = FALSE)
print(p.structure)
Version | Author | Date |
---|---|---|
3770d7e | kevinlkx | 2021-02-03 |
# tissue_labels_cluster
# PreFrontalCortex WholeBrain Cerebellum Lung
# 4261 1964 196 2
# cell_labels_cluster
# Ex. neurons CPN Ex. neurons CThPN Ex. neurons SCPN Inhibitory neurons
# 1822 1449 1394 965
# SOM+ Interneurons Unknown Collisions Purkinje cells
# 466 230 86 5
# Astrocytes Oligodendrocytes
# 3 3
breaks <- c(0,1,5,10,100,Inf)
fit.c14 <- select(fit_multinom,loadings = rows.c14)
p.pca.c14 <- pca_plot(fit.c14,fill = "none")
p.pca.hexbin.c14 <- pca_hexbin_plot(fit.c14,breaks = breaks) + guides(fill = "none")
plot_grid(p.pca.c14, p.pca.hexbin.c14)
Version | Author | Date |
---|---|---|
3770d7e | kevinlkx | 2021-02-03 |
The variation in PCs 1 and 2 is mostly produced by topics 9,10,11
p.pca.c14.topics <- pca_plot(fit.c14,k = c(9,10,11))
print(p.pca.c14.topics)
Version | Author | Date |
---|---|---|
3770d7e | kevinlkx | 2021-02-03 |
colors_celllabels_c14 <- c("firebrick","dodgerblue","forestgreen",
"darkmagenta","darkorange","gold","darkblue",
"peru","greenyellow","black")
p.pca.c14.tissue <- labeled_pca_plot(fit.c14,1:2,samples[rows.c14, "tissue"],font_size = 7,
colors = colors_tissues, legend_label = "tissue")
p.pca.c14.cell <- labeled_pca_plot(fit.c14,1:2,samples[rows.c14, "cell_label"],font_size = 7,
colors = colors_celllabels_c14, legend_label = "cell_label")
plot_grid(p.pca.c14.tissue,p.pca.c14.cell,rel_widths = c(8,11))
Version | Author | Date |
---|---|---|
3770d7e | kevinlkx | 2021-02-03 |
print(samples[intersect(rows.c14, which(samples$tissue == "Lung")),])
# cell tissue tissue.replicate cluster
# 22109 TCCGCGAATAGGTAACTTACTGAGCGACGGCTCTGA Lung Lung2_62216 15
# 61277 TCTCGCGCTATTGCTGGACCTCCGACGGGTACTGAC Lung Lung2_62216 5
# subset_cluster tsne_1 tsne_2 subset_tsne1 subset_tsne2
# 22109 2 -1.07335 -12.29793 -8.494056 -4.764539
# 61277 2 -1.77437 -23.62355 4.220333 2.583797
# id cell_label cluster_kmeans
# 22109 clusters_15.cluster_2 Inhibitory neurons c14
# 61277 clusters_5.cluster_2 Ex. neurons SCPN c14
sessionInfo()
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Scientific Linux 7.4 (Nitrogen)
#
# Matrix products: default
# BLAS/LAPACK: /software/openblas-0.2.19-el7-x86_64/lib/libopenblas_haswellp-r0.2.19.so
#
# locale:
# [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
# [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
# [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] reshape_0.8.8 DT_0.16 fastTopics_0.4-29 RColorBrewer_1.1-2
# [5] plyr_1.8.6 cowplot_1.1.1 ggplot2_3.3.3 dplyr_1.0.3
# [9] Matrix_1.2-18 workflowr_1.6.2
#
# loaded via a namespace (and not attached):
# [1] ggrepel_0.9.1 Rcpp_1.0.6 lattice_0.20-41 tidyr_1.1.2
# [5] prettyunits_1.1.1 rprojroot_2.0.2 digest_0.6.27 R6_2.5.0
# [9] MatrixModels_0.4-1 evaluate_0.14 coda_0.19-4 httr_1.4.2
# [13] pillar_1.4.7 rlang_0.4.10 progress_1.2.2 lazyeval_0.2.2
# [17] data.table_1.13.6 irlba_2.3.3 SparseM_1.78 hexbin_1.28.1
# [21] whisker_0.4 rmarkdown_2.6 labeling_0.4.2 Rtsne_0.15
# [25] stringr_1.4.0 htmlwidgets_1.5.3 munsell_0.5.0 compiler_3.6.1
# [29] httpuv_1.5.4 xfun_0.19 pkgconfig_2.0.3 mcmc_0.9-7
# [33] htmltools_0.5.1.1 tidyselect_1.1.0 tibble_3.0.6 quadprog_1.5-8
# [37] matrixStats_0.58.0 viridisLite_0.3.0 crayon_1.4.0 conquer_1.0.2
# [41] withr_2.4.1 later_1.1.0.1 MASS_7.3-53 grid_3.6.1
# [45] jsonlite_1.7.2 gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0
# [49] git2r_0.27.1 magrittr_2.0.1 scales_1.1.1 RcppParallel_5.0.2
# [53] stringi_1.5.3 farver_2.0.3 fs_1.3.1 promises_1.1.1
# [57] ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.6 tools_3.6.1
# [61] glue_1.4.2 purrr_0.3.4 crosstalk_1.1.1 hms_1.0.0
# [65] yaml_2.2.1 colorspace_2.0-0 plotly_4.9.3 knitr_1.30
# [69] quantreg_5.83 MCMCpack_1.5-0