Last updated: 2023-07-28

Checks: 7 0

Knit directory: SuperCellCyto-analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(42)

The command set.seed(42) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 402358b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 402358b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    code/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/bodenmiller_cytof/
    Ignored:    data/explore_supercell_purity_clustering/
    Ignored:    data/haas_bm/
    Ignored:    data/oetjen_bm_dataset/
    Ignored:    data/trussart_cytofruv/
    Ignored:    output/.DS_Store
    Ignored:    output/bodenmiller_cytof/
    Ignored:    output/explore_supercell_purity_clustering/
    Ignored:    output/label_transfer/
    Ignored:    output/oetjen_b_cell_panel/
    Ignored:    output/trussart_cytofruv/

Untracked files:
    Untracked:  README.html
    Untracked:  analysis/ref.bib
    Untracked:  code/README.html
    Untracked:  code/b_cell_identification/
    Untracked:  code/batch_correction/
    Untracked:  code/bodenmiller_data/
    Untracked:  code/explore_supercell_purity_clustering/
    Untracked:  code/label_transfer/
    Untracked:  data/README.html
    Untracked:  output/README.html

Unstaged changes:
    Modified:   .gitignore
    Modified:   README.md
    Modified:   _workflowr.yml
    Modified:   analysis/_site.yml
    Modified:   code/README.md
    Modified:   data/README.md
    Modified:   output/README.md

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/de_test.Rmd) and HTML (docs/de_test.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	402358b	Givanna Putri	2023-07-28	wflow_publish(c("analysis/*Rmd"))

Introduction

In this analysis, we assess whether a differential expression analysis performed at the supercell level can recapitulate previously published findings obtained by performing differential expression analysis using the Diffcyt algorithm (Weber et al. 2019) at the single cell level.

Specifically, we analysed a publicly available mass cytometry dataset quantifying the immune cells in stimulated and unstimulated human peripheral blood cells (BCR_XL dataset (Bodenmiller et al. 2012)). This is a paired experimental design, with each of the 8 independent samples, obtained from 8 different individuals, contributing to both stimulated and unstimulated samples (16 samples in total).

This dataset was previously analysed using Diffcyt to identify the cell state markers that were differentially expressed between the stimulated samples (BCR-XL group) and the unstimulated samples (Reference group).

Our aim was to replicate these findings using a combination of SuperCellCyto and the Limma R package (Ritchie et al. 2015). The analysis protocol is as the following. First the data was downloaded using the HDCytoData package and an arcsinh transformation with cofactor 5 was applied. Subsequently, SuperCellCyto was applied on the data. The relevant scripts can be found in code/bodenmiller_data/download_bodenmiller_data.R and code/bodenmiller_data/run_supercell_bodenmiller_data.R.

The findings reported here represent the results following the application of SuperCellCyto.

Load libraries and data

library(data.table)
library(ggplot2)
library(limma)
library(pheatmap)
library(gridExtra)
library(here)
library(SuperCell)

supercell_cell_map <- fread(here("output", "bodenmiller_cytof", "20230522", "supercell_gamma20_cell_map.csv"))
cell_info <- fread(here("output", "bodenmiller_cytof", "20230522", "cell_info_with_cell_id.csv"))
supercell_mat <- fread(here("output", "bodenmiller_cytof", "20230522", "supercell_gamma20_exp_mat.csv"))
markers <- fread(here("data", "bodenmiller_cytof", "markers_info.csv"))

supercell_cell_map <- merge.data.table(
  x = supercell_cell_map,
  y = cell_info[, c("cell_id", "population_id"), with = FALSE],
  by.x = "CellId",
  by.y = "cell_id"
)

Examine purity of supercells

Purity was calculated using the implementation provided by the SuperCell package.

supercell_purity <- as.data.table(
  supercell_purity(clusters = supercell_cell_map$population_id,
  supercell_membership = supercell_cell_map$SuperCellID),
  keep.rownames = TRUE
)
setnames(supercell_purity, c("V1", "V2"), c("supercell_id", "purity_score"))

ggplot(supercell_purity, aes(x = purity_score)) +
  geom_histogram(binwidth = .05) +
  labs(
    x = "Purity score",
    y = "Number of supercells",
    title = "Supercell purity for gamma = 20",
    subtitle = paste("Mean purity =", round(mean(supercell_purity$purity_score), 2))
  ) +
  theme_bw() +
  scale_y_continuous(breaks = scales::pretty_breaks(n=10)) +
  scale_x_continuous(breaks = scales::pretty_breaks(n=10))

Differential Expression Analysis

We first annotate the supercells based on the most abundant cell type they contain. In situations where the most abundant cell type is a tie between several different cell types, the one that appears first in the sequence is used as the supercell’s annotation.

supercell_annotation <- supercell_cell_map[, .N, by = .(SuperCellID, population_id)][
  order(-N), .SD[1], by = SuperCellID
]

supercell_mat <- merge.data.table(
  x = supercell_mat,
  y = supercell_annotation[, c("SuperCellID", "population_id")],
  by.x = "SuperCellId",
  by.y = "SuperCellID"
)

Subsequently, we compute the mean expression of each state marker across different sample and cell type.

# Do DE test on just the cell state markers
markers_state <- paste0(markers[markers$marker_class == 'state']$marker_name, "_asinh_cf5")

supercell_mean_perSampPop <- supercell_mat[, lapply(.SD, mean), by = c("sample_id", "population_id"), .SDcols = markers_state]

# Add in the group details and the patient id
sample_info <- unique(cell_info, by = c("group_id", "patient_id", "sample_id"))
sample_info[, c("cell_id", "population_id") := list(NULL, NULL)]

supercell_mean_perSampPop <- merge.data.table(supercell_mean_perSampPop, sample_info, by = "sample_id")
supercell_mean_perSampPop[, `:=`(
  group_id = factor(group_id, levels=c("Reference", "BCR-XL")),
  patient_id = factor(patient_id, levels=paste0("patient", seq(8))),
  sample_id = factor(sample_id)
)]
supercell_mean_perSampPop <- supercell_mean_perSampPop[order(group_id)]

For each cell type, we then used the limma package to identify state markers that are differentially expressed across the 2 conditions. Specifically, we take into account the paired experimental design and use the treat function with fc=1.1 to compute empirical Bayes moderated-t p-values relative to the minimum fold-change threshold set using the fc parameter.

cell_types <- c("B-cells IgM+", "CD4 T-cells", "NK cells")

limma_res_supercell <- lapply(cell_types, function(ct) {

  pseudobulk_sub <- supercell_mean_perSampPop[population_id == ct]

  groups <- pseudobulk_sub$group_id
  
  # Taking into account the paired measurement
  patients <- pseudobulk_sub$patient_id
  design <- model.matrix(~0 + groups + patients)
  colnames(design) <- gsub("groups", "", colnames(design))
  colnames(design) <- gsub("-", "_", colnames(design))
  cont <- makeContrasts(
    RefvsBCRXL = Reference - BCR_XL,
    levels = colnames(design)
  )
  fit <- lmFit(t(pseudobulk_sub[, markers_state, with = FALSE]), design)
  fit_cont <- contrasts.fit(fit, cont)
  fit_cont <- eBayes(fit_cont, trend=TRUE, robust=TRUE)
  treat_all <- treat(fit_cont, fc=1.1)
  
  res <- as.data.table(
    topTreat(treat_all, number = length(markers_state)),
    keep.rownames = TRUE
  )
  # rename the rowname column as marker rather than "rn"
  setnames(res, "rn", "marker")
  res[, cell_type := ct]
  
  return(res)
})
limma_res_supercell <- rbindlist(limma_res_supercell)

Finally we use heatmap to assess the results returned by limma.

# 1 panel per population
pheatmap_sample_info <- copy(sample_info)
pheatmap_sample_info[, `:=`(
  group_id = factor(group_id, levels = c("Reference", "BCR-XL")),
  patient_id = factor(patient_id, levels = paste0("patient", seq_len(8)))
)]
setnames(pheatmap_sample_info, "group_id", "Group")
pheatmap_sample_info <- pheatmap_sample_info[order(Group, patient_id)]
pheatmap_sample_info[, patient_id := NULL]

pheatmap_sample_info <- data.frame(pheatmap_sample_info)
rownames(pheatmap_sample_info) <- pheatmap_sample_info$sample_id
pheatmap_sample_info$sample_id <- NULL

cell_types_to_plot <- c("B-cells IgM+", "CD4 T-cells", "NK cells")

heatmaps <- lapply(cell_types_to_plot, function(ct) {
  # state markers status based on limma result
  markers_states_info <- limma_res_supercell[cell_type == ct,]
  markers_states_info[, FDR_sig := factor(ifelse(adj.P.Val <= 0.05, "Yes", "No"), levels = c("Yes", "No")) ]
  markers_states_info[, marker := gsub("_asinh_cf5", "", marker)]
  markers_states_info <- markers_states_info[order(FDR_sig)]

  markers_states_info <- data.frame(markers_states_info[, c("marker", "FDR_sig")])
  rownames(markers_states_info) <- markers_states_info$marker
  markers_states_info$marker <- NULL

  # filter to keep only a population,
  # change the ordering of rows (samples) to suit sample_info so "reference"
  # group comes first,
  # set sample_id as row names
  # rename the columns to remove _asinh
  # reorganise the columns so the significant ones come first
  pheatmap_data <- supercell_mean_perSampPop[population_id == ct,]
  pheatmap_data[, sample_id := factor(sample_id, levels = rownames(pheatmap_sample_info))]
  rownames(pheatmap_data) <- pheatmap_data$sample_id
  pheatmap_data[, population_id := NULL]
  setnames(pheatmap_data, markers_state, gsub("_asinh_cf5", "", markers_state))

  pheatmap_data[, c("patient_id", "group_id") := list(NULL, NULL)]
  setcolorder(pheatmap_data, c("sample_id", rownames(markers_states_info)))
  
  pheatmap_mat <- as.matrix(pheatmap_data, rownames = "sample_id")

  pheatmap(
    mat = pheatmap_mat,
    cluster_cols = FALSE,
    annotation_row = pheatmap_sample_info,
    annotation_col = markers_states_info,
    main = ct,
    cluster_rows = FALSE,
    annotation_colors = list(
      Group = c("Reference" = "#E6E6FA", "BCR-XL" = "#5D3FD3"),
      FDR_sig = c("Yes" = "darkgreen", "No" = "grey")
    ),
    scale = "column"
  )
})

Our findings were consistent with those identified by Diffcyt, including elevated expression of pS6, pPlcg2, pErk, and pAkt in B cells in the stimulated group, along with reduced expression of pNFkB in the stimulated group.

We also recapitulated the Diffcyt results in CD4 T cells and Natural Killer (NK) cells, with significant differences in the expression of pBtk and pNFkB in CD4 T cells between the stimulated and unstimulated groups, and distinct differences in the expression of pBtk, pSlp76, and pNFkB in NK cells between the stimulated and unstimulated groups.

References

Bodenmiller, Bernd, Eli R Zunder, Rachel Finck, Tiffany J Chen, Erica S Savig, Robert V Bruggner, Erin F Simonds, et al. 2012. “Multiplexed Mass Cytometry Profiling of Cellular States Perturbed by Small-Molecule Regulators.” Nature Biotechnology 30 (9): 858–67.

Ritchie, Matthew E, Belinda Phipson, DI Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–47.

Weber, Lukas M, Malgorzata Nowicka, Charlotte Soneson, and Mark D Robinson. 2019. “Diffcyt: Differential Discovery in High-Dimensional Cytometry via High-Resolution Clustering.” Communications Biology 2 (1): 183.

sessionInfo()

R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SuperCell_1.0     here_1.0.1        gridExtra_2.3     pheatmap_1.0.12  
[5] limma_3.54.1      ggplot2_3.4.1     data.table_1.14.8 workflowr_1.7.0  

loaded via a namespace (and not attached):
 [1] statmod_1.5.0      tidyselect_1.2.0   xfun_0.39          bslib_0.4.2       
 [5] splines_4.2.3      colorspace_2.1-0   vctrs_0.5.2        generics_0.1.3    
 [9] htmltools_0.5.4    yaml_2.3.7         utf8_1.2.3         rlang_1.0.6       
[13] jquerylib_0.1.4    later_1.3.0        pillar_1.8.1       glue_1.6.2        
[17] withr_2.5.0        RColorBrewer_1.1-3 lifecycle_1.0.3    stringr_1.5.0     
[21] munsell_0.5.0      gtable_0.3.1       evaluate_0.20      knitr_1.42        
[25] callr_3.7.3        fastmap_1.1.0      httpuv_1.6.9       ps_1.7.2          
[29] fansi_1.0.4        highr_0.10         Rcpp_1.0.10        promises_1.2.0.1  
[33] scales_1.2.1       cachem_1.0.6       jsonlite_1.8.4     farver_2.1.1      
[37] fs_1.6.1           digest_0.6.31      stringi_1.7.12     processx_3.8.0    
[41] dplyr_1.1.0        getPass_0.2-2      rprojroot_2.0.3    grid_4.2.3        
[45] cli_3.6.0          tools_4.2.3        magrittr_2.0.3     sass_0.4.5        
[49] tibble_3.1.8       whisker_0.4.1      pkgconfig_2.0.3    rmarkdown_2.20    
[53] httr_1.4.4         rstudioapi_0.14    R6_2.5.1           git2r_0.31.0      
[57] compiler_4.2.3

Recovery of Differentially Expressed Cell State Markers Across Stimulated and Unstimulated Human Peripheral Blood Cells

Givanna Putri

2023-05-24

Introduction

Load libraries and data

Examine purity of supercells

Differential Expression Analysis

References