Contents

Introduction to SCENIC

SCENIC is a tool to simultaneously reconstruct gene regulatory networks and identify stable cell states from single-cell RNA-seq data. The gene regulatory network is inferred based on co-expression and DNA motif analysis, and then the network activity is analyzed in each cell to identify the recurrent cellular states.

More info & citation

The approach behind SCENIC and its application to several datasets (e.g. usage examples) was presented in the following article:

Aibar et al. (2017) SCENIC: single-cell regulatory network inference and clustering. Nature Methods. doi: 10.1038/nmeth.4463.

Please, cite this article if you use SCENIC in your research.

Requirements

Species

The current version of SCENIC supports human, mouse and fly (Drosophila melanogaster).

To apply SCENIC to other species, it would require manual adjustments on the second step (e.g. new RcisTarget databases or using a diferent motif-enrichment-analysis tool, we are currently working on providing tutorials for doing this).

Input: expression matrix

The input to SCENIC is the single-cell RNA-seq expression matrix:

  • Each column corresponds to a sample (cell) and each row corresponds to a gene.

  • The gene ID should be the gene-symbol and stored as rownames (for compatibility with RcisTarget annotation databases).

  • Expression units: The preferred expression values are gene-summarized counts. There is currently not a strong recommendation towards using the raw counts, or counts normalized through single-cell specific methods (e.g. Seurat). Other measurements, such as transcripts/counts per million (TPM) and FPKM/RPKM, are also accepted as input. However, note that some authors recommend avoiding within sample normalization (i.e. TPM) for co-expression analysis (first step of SCENIC) because they may induce artificial co-variation (Crow et al. (2016)). The choice of input expression matrix might have some effect on the co-expression analysis to create the regulons (step 1). The other steps of the workflow are not directly affected by the input expression values: (2) The expression is not taken into account for the motif analysis, and (3) AUCell, which is used for scoring the regulons on the cells, is cell ranking-based (it works as an implicit normalization). Overall, SCENIC is quite robust to this choice, we have applied SCENIC to datasets using raw (logged) UMI counts, normalized UMI counts, and TPM and they all provided reliable results (see Aibar et al. (2017)).

Availability (R / Python)

SCENIC is implemented in R (this package and tutorial) and Python (pySCENIC).

The Python implementation is significantly faster to run, so we generally recommend using it for most analyses. We provide containers (in Docker and Singularity) including all the required dependencies, and a Nextflow pipeline (VSN, useful to run SCENIC in batch on multiple datasets). This should make it very easy to install and run (py)SCENIC, even for users with limited experience in Python.

The results are equivalent across versions and provide output .loom files that can be explored in SCope or used as interface between R and Python.

The rest of this tutorial will continue with the R implementation, but the general concepts behind SCENIC algorithm also apply to the other versions.

Installation

The R implementation of SCENIC is based on three R packages:

  1. GENIE3 to infer the co-expression network (faster alternative: GRNBoost2)

  2. RcisTarget for the analysis of transcription factor binding motifs

  3. AUCell to identify cells with active gene sets (gene-network) in scRNA-seq data

Therefore, you will need to install these packages, and some extra dependencies, to run SCENIC:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::version()
# If your bioconductor version is previous to 4.0, see the section bellow

## Required
BiocManager::install(c("AUCell", "RcisTarget"))
BiocManager::install(c("GENIE3")) # Optional. Can be replaced by GRNBoost

## Optional (but highly recommended):
# To score the network on cells (i.e. run AUCell):
BiocManager::install(c("zoo", "mixtools", "rbokeh"))
# For various visualizations and perform t-SNEs:
BiocManager::install(c("DT", "NMF", "ComplexHeatmap", "R2HTML", "Rtsne"))
# To support paralell execution (not available in Windows):
BiocManager::install(c("doMC", "doRNG"))
# To export/visualize in http://scope.aertslab.org
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("aertslab/SCopeLoomR", build_vignettes = TRUE)

You are now ready to install SCENIC:

if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("aertslab/SCENIC") 
packageVersion("SCENIC")

For older versions of R/Bioconductor

For older versions of R/Bioconductor, you might need to install a previous version of SCENIC. e.g.:

# For R version 3.6 and Bioconductor 3.9:  
devtools::install_github("aertslab/SCENIC@v1.1.2")

In case you have trouble installing the packages from Bioconductor (for example, for older versions of R), you may try installing them from Github or directly from the Bioconductor package files:

# Github:
devtools::install_github("aertslab/AUCell")
devtools::install_github("aertslab/RcisTarget")
devtools::install_github("aertslab/GENIE3")

# Bioconductor
# install.packages("https://bioconductor.org/packages/release/bioc/src/contrib/PACKAGENAME.tar.gz", repos=NULL)

Species-specific databases

In addition to the R-packages, you will also need to download the species-specific databases for RcisTarget (the motif rankings). The links to all the available databases are available in our website. By default, SCENIC uses the databases that score the motifs in the promoter of the genes (up to 500bp upstream the TSS), and in the 20kb around the TSS (+/-10kbp).

For human:

dbFiles <- c("https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-500bp-upstream-7species.mc9nr.feather",
"https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg19/refseq_r45/mc9nr/gene_based/hg19-tss-centered-10kb-7species.mc9nr.feather")
# mc9nr: Motif collection version 9: 24k motifs

For mouse:

dbFiles <- c("https://resources.aertslab.org/cistarget/databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/mm9-500bp-upstream-7species.mc9nr.feather",
"https://resources.aertslab.org/cistarget/databases/mus_musculus/mm9/refseq_r45/mc9nr/gene_based/mm9-tss-centered-10kb-7species.mc9nr.feather")
# mc9nr: Motif collection version 9: 24k motifs

For fly:

dbFiles <- c("https://resources.aertslab.org/cistarget/databases/drosophila_melanogaster/dm6/flybase_r6.02/mc8nr/gene_based/dm6-5kb-upstream-full-tx-11species.mc8nr.feather")
# mc8nr: Motif collection version 8: 20k motifs

The approximate file size for these databases (.feather files) is 1GB. To avoid problems or incomplete downloads (especially if you have a slow connection), we recommend to use zsync_curl.

If you prefer to download directly from R, you can try the following code:*

# dir.create("cisTarget_databases"); setwd("cisTarget_databases") # if needed
for(featherURL in dbFiles)
{
  download.file(featherURL, destfile=basename(featherURL)) # saved in current dir
}

To confirm that the databases were downloaded correctly, we recommend to confirm its sha256sum: https://resources.aertslab.org/cistarget/databases/sha256sum.txt

After these setup steps, SCENIC is ready to run! To start, see the vignette “SCENIC_Running”.

Some tips…

Template for your own analysis

You can use the R notebooks of this workflow as template for your own data (i.e. copy the .Rmd file, and edit it in RStudio).

vignetteFile <- file.path(system.file('doc', package='SCENIC'), "SCENIC_Running.Rmd")
file.copy(vignetteFile, "SCENIC_myRun.Rmd")
# or: 
vignetteFile <- "https://raw.githubusercontent.com/aertslab/SCENIC/master/vignettes/SCENIC_Running.Rmd"
download.file(vignetteFile, "SCENIC_myRun.Rmd")

Note that some steps of this workflow can take considerable time. To avoid re-running these steps when knitting the vignette (i.e. create the HTML report), we have added eval=FALSE to some code chunks and load() its output in the next. Feel free to adapt these to your needs.

Help

At any time, you an access the help for any function used in this workflow (i.e. for details on their arguments), and the vignettes of the other steps of the workflow with the following commands:

## Get help for a function:
?runSCENIC_3_scoreCells
help(runSCENIC_3_scoreCells) # equivalent

## See the available tutorials:
vignette(package="SCENIC") # list
vignette("SCENIC_Running") # open

Sample dataset

Running SCENIC in a real dataset typically takes a few hours. The toy example used in these tutorials (200 cells and <1000 genes) is a subset of a dataset of 3005 cells from the adult mouse brain, including neurons (e.g. pyramidal neurons and interneurons) and glia (oligodendrocytes, astrocytes/ependymal, endothelial/mural and microglia). The expression values are Unique Molecular Identifier counts.

Zeisel, A., et al. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142. doi: 10.1126/science.aaa1934

The output files from the run on the full dataset are available at http://scenic.aertslab.org/examples/

Formatting input

Reminder: SCENIC’s main input is a single-cell expression matrix with genes-symbols as row names.

Below there are several options to download and format this dataset (as example to format/import your own data).

a) From .loom file

.loom files can be directly imported into SCENIC through SCopeLoomR.

e.g. for the 3005 mouse brain cells dataset:

## Download:
download.file("http://loom.linnarssonlab.org/clone/Previously%20Published/Cortex.loom", "Cortex.loom")
loomPath <- "Cortex.loom"

To load the expression matrix and cell annotation:

library(SCopeLoomR)
loom <- open_loom(loomPath, mode="r+") # We recommend to open files as read-only (mode="r"). However, since this is an old-format loom file, it needs write access to update it.
exprMat <- get_dgem(loom)
cellInfo <- get_cell_annotation(loom)
close_loom(loom)

b) From 10X/CellRanger output files

10X/CellRanger output matrices can be used as input for SCENIC. The tutorials on how to load the CellRanger output into R are available at 10X website (choose the appropriate CellRanger version): https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices#r-load-mat

Some packages, such as Seurat, also provide functions to directly import 10X/CellRanger output:

singleCellMatrix <- Seurat::Read10X(data.dir="../data/pbmc3k/filtered_gene_bc_matrices/hg19/")

c) From other R objects (e.g. Seurat, SingleCellExperiment)

Many R packages store the expression data in their own data structures (e.g. Seurat) or Bioconductor classes (e.g. SingleCellExperiment SummarizedExperiment, ExpressionSet).

Most of these objects have data acessors to retrieve the expression matrix and cell metadata (the function name depends on the object type or package).

For example, for a SingleCellExperiment object the accessors are counts() and colData():

library(SingleCellExperiment)
## Get data from sce object:
exprMat <- counts(sce)
cellInfo <- colData(sce)

To use Seurat clusters as cell annotation (e.g. for visualization):

cellInfo <- data.frame(seuratCluster=Idents(seuratObject))

d) From GEO

Example downloading and formatting a dataset from GEO (e.g. the 3005 mouse brain cells dataset is available as GEO accession number GSE60361):

# dir.create("SCENIC_MouseBrain"); setwd("SCENIC_MouseBrain") # if needed

# (This may take a few minutes)
if (!requireNamespace("GEOquery", quietly = TRUE)) BiocManager::install("GEOquery")
library(GEOquery)
geoFile <- getGEOSuppFiles("GSE60361", makeDirectory=FALSE)
gzFile <- grep("Expression", basename(rownames(geoFile)), value=TRUE)
txtFile <- gsub(".gz", "", gzFile)
gunzip(gzFile, destname=txtFile, remove=TRUE)

library(data.table)
geoData <- fread(txtFile, sep="\t")
geneNames <- unname(unlist(geoData[,1, with=FALSE]))
exprMatrix <- as.matrix(geoData[,-1, with=FALSE])
rm(geoData)
dim(exprMatrix)
rownames(exprMatrix) <- geneNames
exprMatrix <- exprMatrix[unique(rownames(exprMatrix)),]
exprMatrix[1:5,1:4]

# Remove file downloaded:
file.remove(txtFile)

The cell types are available at the author’s website. For simplicity, here we will load them from AUCell package:

cellLabels <- paste(file.path(system.file('examples', package='AUCell')), "mouseBrain_cellLabels.tsv", sep="/")
cellLabels <- read.table(cellLabels, row.names=1, header=TRUE, sep="\t")
cellInfo <- as.data.frame(cellLabels)
colnames(cellInfo) <- "CellType"

Saving into loom

To run SCENIC in Python we recommend to save/export the expression matrix and cell metadata into a .loom file (for R, any R/Bioconductor object is also OK):

# setwd("SCENIC_MouseBrain")
dir.create("data")
loom <- build_loom("data/mouseBrain.loom", dgem=exprMatrix)
loom <- add_cell_annotation(loom, cellInfo)
close_loom(loom)

References

Aibar, Sara, Carmen Bravo González-Blas, Thomas Moerman, Jasper Wouters, Vân Anh Huynh-Thu, Hana Imrichová, Zeynep Kalender Atak, et al. 2017. “SCENIC: Single-Cell Regulatory Network Inference and Clustering.” Nature Methods 14: 1083–6. https://doi.org/10.1038/nmeth.4463.

Crow, Megan, Anirban Paul, Sara Ballouz, Z. Josh Huang, and Jesse Gillis. 2016. “Exploiting Single-Cell Expression to Characterize Co-Expression Replicability.” Genome Biology 17 (May): 101. https://doi.org/10.1186/s13059-016-0964-6.