Vignette built on May 31, 2019 with SCENIC version 1.1.2.1.
This tutorial goes through the steps in the SCENIC workflow:
Building the gene regulatory network (GRN):
Identify cell states and their regulators:
To start this tutorial you should have read the “Introduction and setup” vignette (vignette("SCENIC_Setup")
) and run the setup steps.
Overview of the main commands used to run the SCENIC workflow. To be used as cheatsheet or template (it is not exhaustive). They are explained in the following sections.
### Load data
loomPath <- system.file(package="SCENIC", "examples/mouseBrain_toy.loom")
library(SCopeLoomR)
loom <- open_loom(loomPath, mode="r")
exprMat <- get_dgem(loom)
cellInfo <- get_cellAnnotation(loom)
close_loom(loom)
### Initialize settings
library(SCENIC)
scenicOptions <- initializeScenic(org="mgi", dbDir="cisTarget_databases", nCores=10)
# scenicOptions@inputDatasetInfo$cellInfo <- "int/cellInfo.Rds"
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
### Co-expression network
genesKept <- geneFiltering(exprMat, scenicOptions)
exprMat_filtered <- exprMat[genesKept, ]
runCorrelation(exprMat_filtered, scenicOptions)
exprMat_filtered_log <- log2(exprMat_filtered+1)
runGenie3(exprMat_filtered_log, scenicOptions)
### Build and score the GRN
exprMat_log <- log2(exprMat+1)
scenicOptions@settings$dbs <- scenicOptions@settings$dbs["10kb"] # Toy run settings
runSCENIC_1_coexNetwork2modules(scenicOptions)
runSCENIC_2_createRegulons(scenicOptions, coexMethod=c("top5perTarget")) # Toy run settings
runSCENIC_3_scoreCells(scenicOptions, exprMat_log)
# Export:
export2scope(scenicOptions, exprMat)
# Binarize activity?
# aucellApp <- plotTsne_AUCellApp(scenicOptions, exprMat_log)
# savedSelections <- shiny::runApp(aucellApp)
# newThresholds <- savedSelections$thresholds
# scenicOptions@fileNames$int["aucell_thresholds",1] <- "int/newThresholds.Rds"
# saveRDS(newThresholds, file=getIntName(scenicOptions, "aucell_thresholds"))
# saveRDS(scenicOptions, file="int/scenicOptions.Rds")
runSCENIC_4_aucell_binarize(scenicOptions)
### Exploring output
# Check files in folder 'output'
# .loom file @ http://scope.aertslab.org
# output/Step2_MotifEnrichment_preview.html in detail/subset:
motifEnrichment_selfMotifs_wGenes <- loadInt(scenicOptions, "motifEnrichment_selfMotifs_wGenes")
tableSubset <- motifEnrichment_selfMotifs_wGenes[highlightedTFs=="Sox8"]
viewMotifs(tableSubset)
# output/Step2_regulonTargetsInfo.tsv in detail:
regulonTargetsInfo <- loadInt(scenicOptions, "regulonTargetsInfo")
tableSubset <- regulonTargetsInfo[TF=="Stat6" & highConfAnnot==TRUE]
viewMotifs(tableSubset)
During this workflow we will save multiple files. To keep them tidy, we recommend to set the working directory to a new folder.
For example:
dir.create("SCENIC_MouseBrain")
setwd("SCENIC_MouseBrain") # Or `knitr::opts_knit$set(root.dir = 'example_results/SCENIC_MouseBrain')` in the first chunk if running a notebook
By default, the intermediate files and plots will be saved into the int
folder, with a numbered prefix to keep them in order. You may use these files to check details about each step, or re-run parts of the analysis with different settings. The main output (i.e. plots, HTML reports,… ) will be saved into the output
folder.
The input for SCENIC is a single-cell RNA-seq expression matrix (with gene-symbol as rownames
, see the vignette("SCENIC_Setup")
for details). The first step is to load this matrix.
For this tutorial we provide a toy example only 200 cells and <1000 genes from the mouse brain (described in the setup vignette):
loomPath <- system.file(package="SCENIC", "examples/mouseBrain_toy.loom")
Open the loom file and load the expression matrix (and cell annotation if available)
library(SCopeLoomR)
loom <- open_loom(loomPath, mode="r")
exprMat <- get_dgem(loom)
cellInfo <- get_cellAnnotation(loom)
close_loom(loom)
dim(exprMat)
## [1] 862 200
In Step 3-4 (scoring the GRN and clustering), it is interesting to compare the results with known information about the cells. You can already indicate which variables will be to plot, and assign them a specific color (otherwise one will be assigned automatically, e.g. when colloring on on the heatmaps or t-SNE).
# cellInfo$nGene <- colSums(exprMat>0)
head(cellInfo)
## Class nGene nUMI
## 1772066100_D04 interneurons 170 509
## 1772063062_G01 oligodendrocytes 152 443
## 1772060224_F07 microglia 218 737
## 1772071035_G09 pyramidal CA1 265 1068
## 1772067066_E12 oligodendrocytes 81 273
## 1772066100_B01 pyramidal CA1 108 191
cellInfo <- data.frame(cellInfo)
cellTypeColumn <- "Class"
colnames(cellInfo)[which(colnames(cellInfo)==cellTypeColumn)] <- "CellType"
cbind(table(cellInfo$CellType))
## [,1]
## astrocytes_ependymal 29
## interneurons 43
## microglia 14
## oligodendrocytes 55
## pyramidal CA1 59
dir.create("int")
## Warning in dir.create("int"): 'int' already exists
saveRDS(cellInfo, file="int/cellInfo.Rds")
# Color to assign to the variables (same format as for NMF::aheatmap)
colVars <- list(CellType=c("microglia"="forestgreen",
"endothelial-mural"="darkorange",
"astrocytes_ependymal"="magenta4",
"oligodendrocytes"="hotpink",
"interneurons"="red3",
"pyramidal CA1"="skyblue",
"pyramidal SS"="darkblue"))
colVars$CellType <- colVars$CellType[intersect(names(colVars$CellType), cellInfo$CellType)]
saveRDS(colVars, file="int/colVars.Rds")
plot.new(); legend(0,1, fill=colVars$CellType, legend=names(colVars$CellType))
In order to keep consistent settings across the multiple steps of SCENIC, most functions in SCENIC package use a common object where the options for the current run are stored. This object replaces the “arguments” for most functions, and should be created at the begining of a SCENIC run with the function initializeScenic()
.
The default settings should be valid for most analyses. The parameters that need to be specified in all runs is the organism (mgi
for mouse, hgnc
for human, or dmel
for fly), and the directory where the RcisTarget databases are stored (you may create a link in the current directory to avoid duplicating them, e.g. in linux: system("ln -s ~/path/to/dbs databases")
).
For details on the options that can be modified check the help of ?initializeScenic
or of the specific function that takes it as input.
library(SCENIC)
org="mgi" # or hgnc, or dmel
dbDir="cisTarget_databases" # RcisTarget databases location
myDatasetTitle="SCENIC example on Mouse brain" # choose a name for your analysis
data(defaultDbNames)
dbs <- defaultDbNames[[org]]
scenicOptions <- initializeScenic(org=org, dbDir=dbDir, dbs=dbs, datasetTitle=myDatasetTitle, nCores=10)
## Motif databases selected:
## mm9-500bp-upstream-7species.mc9nr.feather
## mm9-tss-centered-10kb-7species.mc9nr.feather
# Modify if needed
scenicOptions@inputDatasetInfo$cellInfo <- "int/cellInfo.Rds"
scenicOptions@inputDatasetInfo$colVars <- "int/colVars.Rds"
# Databases:
# scenicOptions@settings$dbs <- c("mm9-5kb-mc8nr"="mm9-tss-centered-5kb-10species.mc8nr.feather")
# scenicOptions@settings$db_mcVersion <- "v8"
# Save to use at a later time...
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
The first step of the SCENIC workflow is to infer potential transcription factor targets based on the expression data. To do this we use GENIE3 or GRNBoost. The input to either of these tools are the expression matrix (filtered), and a list of transcription factors (potential regulators). The output of GENIE3/GRBBoost, and a correlation matrix will be used to create the co-expression modules (runSCENIC_1_coexNetwork2modules()
).
Choosing between GENIE3/GRNBoost: In principle, many of the existing methods to infer co-expression networks could be used for this step, as long as its output is a list of potential targets for each TF (and it can be applied to scRNA-seq…). We selected GENIE3 (Huynh-Thu et al. (2010)) because it allows to identify non-linear relationships, even if they are only present in a subset of samples, and it was the best performer in the Network Inference DREAM5 challenge (Marbach et al. (2012)). GENIE3 can easily be run within R.
However, GENIE3 is very time- and computationally-consuming (it will take several hours or days on datasets of 3-5k cells). To allow scalability to bigger datasets, we created GRNboost (see Aibar et al. (2017)) and the arboreto framework. GRNBoost provides similar results to GENIE3 in just a fraction of the time (publication in press), so we highly recommend it for bigger datasets.
Subsampling cells: When there is a high proportion of low-quality cells, or if the computation time is an issue, it is also possible to infer the regulatory network using a subset of cells (e.g. selecting random or high-quality cells as input to the co-expression analysis). The activity of the regulatory network, trained on this subset of cells, can then be evaluated on all the cells in the dataset with AUCell (Step 3). Note that to avoid loss of resolution, the subset of cells should be representative of the whole dataset (e.g. contain sufficient representation of all the cell types). Examples of this approach are presented in Aibar et al. (2017) (i.e. subsampling this mouse brain dataset, and the analysis of 49k cells from mouse retina).
To run GENIE3/GRNBoost we recommend to apply soft gene filter, to remove genes that are expressed either at very low levels or in too few cells. Here we apply a filtering based on the total number of counts of the gene, and the number of cells in which it is detected.
Filter by the total number of reads per gene. This filter is meant to remove genes that are most likely noise. By default it keeps only the genes with at least 6 UMI counts across all samples (e.g. the total number the gene would have, if it was expressed with a value of 3
in 1% of the cells). Adjust this value (minCountsPerGene
) according to the dataset (it will depend on the dataset units, e.g. UMI, TPMs…).
Filter by the number of cells in which the gene is detected** (e.g. >0 UMI, or >1 log2(TPM)). By default (minSamples
), genes that are detected in at least 1% of the cells are kept. This filtering is meant to remove genes whose reads come from one a few ‘noisy’ cells (genes that are only expressed in one, or very few cells, gain a lot of weight if they happen to coincide in a given cell). To avoid removing small (but potentially interesting) cell populations, we recommend to set a percentage lower than the smallest population of cells to be detected.
Finally, only the genes that are available in RcisTarget databases will be kept. This filter is mostly to save some running time for GENIE3/GRNBoost, since the genes that are not available in the databases will not be used in upcoming steps.
# (Adjust minimum values according to your dataset)
genesKept <- geneFiltering(exprMat, scenicOptions=scenicOptions,
minCountsPerGene=3*.01*ncol(exprMat),
minSamples=ncol(exprMat)*.01)
## Maximum value in the expression matrix: 421
## Ratio of detected vs non-detected: 0.24
## Number of counts (in the dataset units) per gene:
## Number of cells in which each gene is detected:
##
## Number of genes left after applying the following filters (sequential):
## 771 genes with counts per gene > 6
## 770 genes detected in more than 2 cells
## 770 genes available in RcisTarget database
## Gene list saved in int/1.1_genesKept.Rds
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 17.0 48.0 146.8 137.0 5868.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 25.00 39.01 60.00 180.00
Before proceeding to the network inference, check whether any known relevant genes are filtered-out (if any relevant gene is missing, double-check whether the filters are appropiate):
interestingGenes <- c("Sox9", "Sox10", "Dlx5")
# any missing?
interestingGenes[which(!interestingGenes %in% genesKept)]
## character(0)
We can now filter the expression matrix to contain only these 770 genes. This matrix is now ready for the co-expression analysis.
exprMat_filtered <- exprMat[genesKept, ]
dim(exprMat_filtered)
## [1] 770 200
exprMat
will not be used for the co-expression analysis, it can be unloaded:
rm(exprMat)
GENIE3/GRNBoost can detect both positive and negative associations. In order to distinguish potential activation from repression, we will split the targets into positive- and negative-correlated targets (i.e. Spearman correlation between the TF and the potential target).
(This step can be run either before/after or simultaneously to GENIE3/GRNBoost)
Calculate the correlation:
runCorrelation(exprMat_filtered, scenicOptions)
To run GRNBoost (in Python) instead of GENIE3. See ?exportsForGRNBoost
for details.
The input to GENIE3 is typically an expression matrix and a list of candidate regulators. The function runGenie3
will run GENIE3 with default settings, which are usually adequate for most datasets, using the transcription factors available in RcisTarget databases as candidate regulators.
Since GENIE3 is based on a Random Forest approach, each time it is run the results will be slightly different. The higher the number of trees used (ntrees
), the lower the variability. We recommend to use set.seed
to reproduce exact results in multiple runs. For more details, check ?GENIE3
(GENIE3 help) or ?runGenie3
(SCENIC wrapper for GENIE3).
GENIE3 will typically take several hours (or days) to run. If you are running this workflow on an RStudio session, we recommend that you stop here and run the next code chunk in an independent R console (i.e. with screen
/tmux
) or in an server/HPC (if available). The upcoming code chunks will resume the workflow by loading GENIE3 output.
## If launched in a new session, you will need to reload...
# setwd("...")
# loomPath <- "..."
# loom <- open_loom(loomPath, mode="r")
# exprMat <- get_dgem(loom)
# close_loom(loom)
# genesKept <- loadInt(scenicOptions, "genesKept")
# exprMat_filtered <- exprMat[genesKept,]
# library(SCENIC)
# scenicOptions <- readRDS("int/scenicOptions.Rds")
# Optional: add log (if it is not logged/normalized already)
exprMat_filtered <- log2(exprMat_filtered+1)
# Run GENIE3
runGenie3(exprMat_filtered, scenicOptions)
## Using 8 TFs as potential regulators...
## Warning in runGenie3(exprMat_filtered, scenicOptions): Only 0% of the 1721 TFs in the database were found in the dataset. Do they use the same gene IDs?
## Running GENIE3 part 1
## Running GENIE3 part 2
## Running GENIE3 part 3
## Running GENIE3 part 4
## Running GENIE3 part 5
## Running GENIE3 part 6
## Running GENIE3 part 7
## Running GENIE3 part 8
## Running GENIE3 part 9
## Running GENIE3 part 10
## Finished running GENIE3.
Once the results from GENIE3/GRNBoost (and the correlation) are ready, the remaining steps of SCENIC can be run.
The easiest/fastest way is to use the following wrapper functions, each of them corresponding to one of the main steps in SCENIC workflow:
Build the gene regulatory network: 1. Get co-expression modules 2. Get regulons (with RcisTarget): TF motif analysis)
Identify cell states: 3. Score GRN (regulons) in the cells (with AUCell) 4. Cluster cells according to the GRN activity
An overview of the steps workflow is explained in Aibar et al. (2017). Detailed tutorials/notebooks explaining these functions in detail are also available (see
vignette(package="SCENIC")
). These might be useful for users who want to know the details of the implementation, understand the results more in depth, or to modify or run only some of the steps of the workflow. We recommend to checkvignette("detailedStep_2_createRegulons", package="SCENIC")
to understand how the Gene Regulatory Networks are build. For more info on the scoring of the networks on the cells seevignette("detailedStep_3_scoreCells", package="SCENIC")
.
Re-load the expression matrix if necessary:
loom <- open_loom(loomPath, mode="r")
exprMat <- get_dgem(loom)
close_loom(loom)
# Optional: log expression (for TF expression plot, it does not affect any other calculation)
logMat <- log2(exprMat+1)
dim(exprMat)
## [1] 862 200
Run the remaining steps using the wrapper functions:
library(SCENIC)
scenicOptions <- readRDS("int/scenicOptions.Rds")
scenicOptions@settings$verbose <- TRUE
scenicOptions@settings$nCores <- 10
scenicOptions@settings$seed <- 123
# For a very quick run:
# coexMethod=c("top5perTarget")
scenicOptions@settings$dbs <- scenicOptions@settings$dbs["10kb"] # For toy run
# save...
runSCENIC_1_coexNetwork2modules(scenicOptions)
runSCENIC_2_createRegulons(scenicOptions, coexMethod=c("top5perTarget")) #** Only for toy run!!
runSCENIC_3_scoreCells(scenicOptions, logMat)
** For a quick run, using the ‘toy dataset’, you may set runSCENIC_2_createRegulons(scenicOptions, coexMethod="top5perTarget")
and only one of the RcisTarget databases. These results will not be as comprehensive as the full run, but might be enough for getting a feel of the interface.
Building the GRN and scoring its activity in AUCell is often enough for datasets with very clear cell types. However, in many cases it is also useful to binarize the activity score into “on/off”; either for easier interpretation, or for maximizing the differences across cell types. It is possible to binarize only specific regulons for exploring/interpreting key TFs. However, binarizing the activity of all the regulons in the dataset allows to create the “Binarized regulon activity matrix”, which can be used for upstream analysis (e.g. clustering). The binarized activity is specially useful to reduce technical biases (e.g. number of detected genes, batch effects), the grouping by patient of origin in cancer datasets, or even cross-species comparisons (see Aibar et al. (2017)).
To determine in which cells each regulon is active, we will use an AUC threshold. AUCell automatically calculates possible thresholds for the binarization, but they are often too conservative. We recommend to check these thresholds manually before proceeding to the binarization. This can be a iterative process, where the thresholds can be re-adjusted after an initial exploration. Once the final thresholds are selected, the cell-regulon activity will be summarized into a binary activity matrix in which the columns represent the cells and the rows the regulons. The coordinates of the matrix that correspond to active regulons in a given cell will contain a “1” value, and “0” all the others.
You can see the selected thresholds in the output from the previous step [file: output/Step3_3.2_AUCtSNEs.html
(If you are using Rstudio, you might need to download the file and accompanying folder)], and these can be adjusted with AUCell’s Shiny app:
aucellApp <- plotTsne_AUCellApp(scenicOptions, logMat)
savedSelections <- shiny::runApp(aucellApp)
# Save the modified thresholds:
newThresholds <- savedSelections$thresholds
scenicOptions@fileNames$int["aucell_thresholds",1] <- "int/newThresholds.Rds"
saveRDS(newThresholds, file=getIntName(scenicOptions, "aucell_thresholds"))
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
Once you have optimized the thresholds, run runSCENIC_4_aucell_binarize
to binarize the AUC, and generate some extra figures and clusterings:
# scenicOptions@settings$devType="png"
runSCENIC_4_aucell_binarize(scenicOptions)
The t-SNEs can also be created using the binary activity matrix (in the same way as indicated in section “Creating/comparing t-SNEs”), just set tsneAUC( ..., aucType="binary")
instead.
The cells can be grouped/clustered based on the regulon activity, either continuous or binarized (See the section below “Exploring > Cell states” for details).
If using t-SNE as visualization, it is recommended to try different settings to evaluate the stability of the states/clusters. Feel free to use UMAP, other clustering methods (or trajectory inference methods, if appropriate) instead.
The function included in SCENIC package runs multiple t-SNEs with different settings; It will create all combinations between the selected “number of PCs” and “perplexity” (expected running time: few minutes to hours, depending on the number of cells):
nPcs <- c(5) # For toy dataset
# nPcs <- c(5,15,50)
scenicOptions@settings$seed <- 123 # same seed for all of them
# Run t-SNE with different settings:
fileNames <- tsneAUC(scenicOptions, aucType="AUC", nPcs=nPcs, perpl=c(5,15,50))
fileNames <- tsneAUC(scenicOptions, aucType="AUC", nPcs=nPcs, perpl=c(5,15,50), onlyHighConf=TRUE, filePrefix="int/tSNE_oHC")
# Plot as pdf (individual files in int/):
fileNames <- paste0("int/",grep(".Rds", grep("tSNE_", list.files("int"), value=T), value=T))
Note: The toy dataset only contains ~8 regulons; using more than 8 PCs will not provide any difference…
and to view/compare them…
par(mfrow=c(length(nPcs), 3))
fileNames <- paste0("int/",grep(".Rds", grep("tSNE_AUC", list.files("int"), value=T, perl = T), value=T))
plotTsne_compareSettings(fileNames, scenicOptions, showLegend=FALSE, varName="CellType", cex=.5)
# Using only "high-confidence" regulons (normally similar)
par(mfrow=c(3,3))
fileNames <- paste0("int/",grep(".Rds", grep("tSNE_oHC_AUC", list.files("int"), value=T, perl = T), value=T))
plotTsne_compareSettings(fileNames, scenicOptions, showLegend=FALSE, varName="CellType", cex=.5)
The chosen t-SNE can then be saved as default to use for plots (can also be “binary”, see below):
scenicOptions@settings$defaultTsne$aucType <- "AUC"
scenicOptions@settings$defaultTsne$dims <- 5
scenicOptions@settings$defaultTsne$perpl <- 15
saveRDS(scenicOptions, file="int/scenicOptions.Rds")
The results from SCENIC can also be explored in http://scope.aertslab.org (Davie et al. (2018)).
The .loom
file can be created with the function export2scope()
(requires the package SCopeLoomR
). This function saves the the main results from SCENIC into a .loom file:
Regulons
Regulon activity (AUC matrix and thresholds)
Embeddings (e.g. t-SNE and UMAP on the regulon activity)
The motif enrichment analysis and co-expression modules (e.g. GRNBoost/GENIE3 output) are stored in independent text files (mostly due to their bigger size).
# DGEM (Digital gene expression matrix)
# (non-normalized counts)
# exprMat <- get_dgem(open_loom(loomPath))
# dgem <- exprMat
# head(colnames(dgem)) #should contain the Cell ID/name
# Export:
scenicOptions@fileNames$output["loomFile",] <- "output/mouseBrain_SCENIC.loom"
export2scope(scenicOptions, exprMat)
To add extra data (e.g. embeddings or clusters), see help(package="SCopeLoomR")
.
SCopeLoomR
also provides functions to import the regulons, AUC, and embeddings from the loom file. e.g.:
library(SCopeLoomR)
scenicLoomPath <- getOutName(scenicOptions, "loomFile")
loom <- open_loom(scenicLoomPath)
# Read information from loom file:
regulons_incidMat <- get_regulons(loom)
regulons <- regulonsToGeneLists(regulons_incidMat)
regulonsAUC <- get_regulonsAuc(loom)
regulonsAucThresholds <- get_regulonThresholds(loom)
embeddings <- get_embeddings(loom)
The output
folder contains several files that provide an overview of the results from each step. These results can be explored in more detail through the intermediate files (saved in the int
folder, which can be listed with loadInt(scenicOptions)
).
Some examples on how to explore the results:
AUCell provides the activity of the regulons across the cells. By clustering the cells based on this regulon activity (either the continuous or binary AUC matrix), we can see whether there are groups of cells that tend to have the same regulons active, and reveal the network states that are recurrent across multiple cells. These states would be equivalent to the attractor states of the network. Combining these clustering with different visualization methods, we can explore the association of cell states with specific regulons.
SCENIC provides some wrapper functions to get a quick overview. For example, projecting the AUC and TF expression onto t-SNEs, and visualizing of the AUC as heatmaps, but feel free to explore alternative clustering and visualization tools.
Briefly, a t-SNE is a 2D projection of the cells, where cells (dots) are placed close to each other if they have similar input profiles (in our case, regulon activity). The t-SNE usually allows to get a quick and easy overview of the cell states in the dataset. Note however, that t-SNE works well to identify distinct classes, but it is not appropiate for dinamic/continuous processes (e.g. trajectory-like visualizations).
AUCell’s interactive app (for SCope, see section “Export to loom/SCope”):
logMat <- exprMat # Better if it is logged/normalized
aucellApp <- plotTsne_AUCellApp(scenicOptions, logMat) # default t-SNE
savedSelections <- shiny::runApp(aucellApp)
AUCell_plotTSNE()
to save static plots:
print(tsneFileName(scenicOptions))
## [1] "int/tSNE_AUC_05pcs_15perpl.Rds"
tSNE_scenic <- readRDS(tsneFileName(scenicOptions))
aucell_regulonAUC <- loadInt(scenicOptions, "aucell_regulonAUC")
# Show TF expression:
par(mfrow=c(2,3))
AUCell::AUCell_plotTSNE(tSNE_scenic$Y, exprMat, aucell_regulonAUC[onlyNonDuplicatedExtended(rownames(aucell_regulonAUC))[c("Dlx5", "Sox10", "Sox9","Irf1", "Stat6")],], plots="Expression")
# Save AUC as PDF:
Cairo::CairoPDF("output/Step4_BinaryRegulonActivity_tSNE_colByAUC.pdf", width=20, height=15)
par(mfrow=c(4,6))
AUCell::AUCell_plotTSNE(tSNE_scenic$Y, cellsAUC=aucell_regulonAUC, plots="AUC")
dev.off()
Density plot to detect most likely stable states (higher-density areas in the t-SNE):
library(KernSmooth)
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
library(RColorBrewer)
dens2d <- bkde2D(tSNE_scenic$Y, 1)$fhat
image(dens2d, col=brewer.pal(9, "YlOrBr"), axes=FALSE)
contour(dens2d, add=TRUE, nlevels=5, drawlabels=FALSE)
Show several regulons simultaneously:
#par(bg = "black")
par(mfrow=c(1,2))
regulonNames <- c( "Dlx5","Sox10")
cellCol <- plotTsne_rgb(scenicOptions, regulonNames, aucType="AUC", aucMaxContrast=0.6)
text(0, 10, attr(cellCol,"red"), col="red", cex=.7, pos=4)
text(-20,-10, attr(cellCol,"green"), col="green3", cex=.7, pos=4)
regulonNames <- list(red=c("Sox10", "Sox8"),
green=c("Irf1"),
blue=c( "Tef"))
cellCol <- plotTsne_rgb(scenicOptions, regulonNames, aucType="Binary")
text(5, 15, attr(cellCol,"red"), col="red", cex=.7, pos=4)
text(5, 15-4, attr(cellCol,"green"), col="green3", cex=.7, pos=4)
text(5, 15-8, attr(cellCol,"blue"), col="blue", cex=.7, pos=4)
Genes included in the regulons:
regulons <- loadInt(scenicOptions, "regulons")
regulons[c("Dlx5", "Irf1")]
## $Dlx5
## [1] "2610203C20Rik" "Adamts17" "AI854703" "Arhgef10l" "Bahcc1" "Cirbp" "Cplx1" "Dlx1" "Gad1" "Gadd45gip1" "Hexim2" "Igf1"
## [13] "Iglon5" "Ltbp3" "Myt1" "Npas1" "Nxph1" "Peli2" "Plekha6" "Prkab1" "Ptchd2" "Racgap1" "Rgs8" "Robo1"
## [25] "Rpl34" "Sema3c" "Shisa9" "Slc12a5" "Slc39a6" "Spns2" "Stox2" "Syt6" "Unc5d" "Wnt5a"
##
## $Irf1
## [1] "4930523C07Rik" "Acsl5" "Adrb1" "Ahdc1" "Ahnak" "Akap1" "Anxa2" "Arhgap31" "Arhgef10l" "Arhgef19" "Atg14" "B2m"
## [13] "Bank1" "Bcl2a1b" "Cald1" "Capsl" "Ccl2" "Ccl3" "Ccl7" "Ccnd1" "Ccnd3" "Cd163" "Cd48" "Cited2"
## [25] "Cmtm6" "Ctgf" "Ctnnd1" "Ctss" "Ddr2" "E130114P18Rik" "Egfr" "Ehd1" "Esam" "Fcer1g" "Fcgr1" "Fgf14"
## [37] "Fstl4" "Fyb" "Gabrb1" "Gadd45g" "Gja1" "Gpr34" "Gramd3" "H2-K1" "Hfe" "Il6ra" "Irf1" "Itgb5"
## [49] "Kcne4" "Kcnj16" "Kcnj2" "Laptm5" "Lhfp" "Map3k5" "Mdk" "Med13" "Midn" "Mr1" "Ms4a7" "Msr1"
## [61] "Myh9" "Nin" "Nptx1" "Osbpl6" "P2ry13" "Parp12" "Peli2" "Plcl2" "Plekha6" "Prkab1" "Rab3il1" "Rgs5"
## [73] "Rhobtb2" "Rnase4" "Sepp1" "Sertad2" "Sgk3" "Slamf9" "Slc4a4" "Slc7a7" "Slco5a1" "Slitrk2" "Soat1" "Srgn"
## [85] "St3gal6" "St5" "St8sia4" "Stard8" "Stat6" "Stk17b" "Tapbp" "Tbxas1" "Tgfa" "Tgfb3" "Tmem100" "Tnfaip3"
## [97] "Tnfaip8" "Trib1" "Trpm7" "Txnip" "Usp2" "Vgll4" "Vps37b" "Zfp36l2" "Zfp703"
Note than only regulons with 10 genes or more are scored with AUCell:
regulons <- loadInt(scenicOptions, "aucell_regulons")
head(cbind(onlyNonDuplicatedExtended(names(regulons))))
## [,1]
## Tef "Tef (405g)"
## Dlx5 "Dlx5 (35g)"
## Sox9 "Sox9 (150g)"
## Sox8 "Sox8 (97g)"
## Sox10 "Sox10 (88g)"
## Irf1 "Irf1 (105g)"
Details on the TF-target links: For each TF-target pair, the stats from the intermediate steps are summarized in loadInt(scenicOptions, "regulonTargetsInfo")
(saved as text in: getOutName(scenicOptions, "s2_regulonTargetsInfo")
: output/Step2_regulonTargetsInfo.tsv). This table can be used to explore the support to specific links. Since it will typically contain several thousand rows (in this run: 1277), in most cases it is advisable to subset it before exporting it as HTML.
regulonTargetsInfo <- loadInt(scenicOptions, "regulonTargetsInfo")
tableSubset <- regulonTargetsInfo[TF=="Stat6" & highConfAnnot==TRUE]
viewMotifs(tableSubset)
The full list of TF motifs supporting the regulons can be seen in the restuls from RcisTarget motif enrichment results (for the co-expression modules). These are saved in motifEnrichment_selfMotifs_wGenes
. A preview of these results is exported as html in output/Step2_MotifEnrichment_preview.html (and as text in: output/Step2_MotifEnrichment.tsv).
Alternative tables, showing more or fewer rows/columns could be generated modifiying this code:
motifEnrichment_selfMotifs_wGenes <- loadInt(scenicOptions, "motifEnrichment_selfMotifs_wGenes")
tableSubset <- motifEnrichment_selfMotifs_wGenes[highlightedTFs=="Dlx5"]
viewMotifs(tableSubset)
The regulatory analysis from SCENIC can be combined with other analyses, for example clustering, or focus on regulators for specific cell types. There are multiple options to do these analyses (your imagination is the limit!). Here are some quick examples to start:
(Clusters could also be used instead of “cell types”, e.g. with Seurat: cellInfo <- data.frame(seuratCluster=Idents(seuratObject))
)
regulonAUC <- loadInt(scenicOptions, "aucell_regulonAUC")
regulonAUC <- regulonAUC[onlyNonDuplicatedExtended(rownames(regulonAUC)),]
regulonActivity_byCellType <- sapply(split(rownames(cellInfo), cellInfo$CellType),
function(cells) rowMeans(getAUC(regulonAUC)[,cells]))
regulonActivity_byCellType_Scaled <- t(scale(t(regulonActivity_byCellType), center = T, scale=T))
pheatmap::pheatmap(regulonActivity_byCellType_Scaled, #fontsize_row=3,
color=colorRampPalette(c("blue","white","red"))(100), breaks=seq(-3, 3, length.out = 100),
treeheight_row=10, treeheight_col=10, border_color=NA)
# filename="regulonActivity_byCellType.pdf", width=10, height=20)
topRegulators <- reshape2::melt(regulonActivity_byCellType_Scaled)
colnames(topRegulators) <- c("Regulon", "CellType", "RelativeActivity")
topRegulators <- topRegulators[which(topRegulators$RelativeActivity>0),]
viewTable(topRegulators)
minPerc <- .7
binaryRegulonActivity <- loadInt(scenicOptions, "aucell_binary_nonDupl")
cellInfo_binarizedCells <- cellInfo[which(rownames(cellInfo)%in% colnames(binaryRegulonActivity)),, drop=FALSE]
regulonActivity_byCellType_Binarized <- sapply(split(rownames(cellInfo_binarizedCells), cellInfo_binarizedCells$CellType),
function(cells) rowMeans(binaryRegulonActivity[,cells, drop=FALSE]))
binaryActPerc_subset <- regulonActivity_byCellType_Binarized[which(rowSums(regulonActivity_byCellType_Binarized>minPerc)>0),]
pheatmap::pheatmap(binaryActPerc_subset, # fontsize_row=5,
color = colorRampPalette(c("white","pink","red"))(100), breaks=seq(0, 1, length.out = 100),
treeheight_row=10, treeheight_col=10, border_color=NA)
#filename="regulonActivityBinary_byCellType.pdf", width=10, height=20)
topRegulators <- reshape2::melt(regulonActivity_byCellType_Binarized)
colnames(topRegulators) <- c("Regulon", "CellType", "RelativeActivity")
topRegulators <- topRegulators[which(topRegulators$RelativeActivity>minPerc),]
viewTable(topRegulators)
library(Seurat)
dr_coords <- Embeddings(seuratObject, reduction="tsne")
tfs <- c("Sox10","Irf1","Sox9", "Dlx5")
par(mfrow=c(2,2))
AUCell::AUCell_plotTSNE(dr_coords, cellsAUC=selectRegulons(regulonAUC, tfs), plots = "AUC")
close_loom(loom)
date()
## [1] "Fri May 31 15:18:43 2019"
sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux ComputeNode release 6.5 (Santiago)
##
## Matrix products: default
## BLAS: /ddn1/vol1/staging/leuven/stg_00002/lcb/dwmax/software/R/3.5.1-X11/lib64/R/lib/libRblas.so
## LAPACK: /ddn1/vol1/staging/leuven/stg_00002/lcb/dwmax/software/R/3.5.1-X11/lib64/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8
## [8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 KernSmooth_2.23-15 doRNG_1.7.1 rngtools_1.3.1 pkgmaker_0.27 registry_0.5-1 foreach_1.4.4 SCopeLoomR_0.4.0 RcisTarget_1.3.6
## [10] AUCell_1.7.2 SCENIC_1.1.2-01 BiocParallel_1.14.2
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 matrixStats_0.54.0 bit64_0.9-7 doParallel_1.0.14 GENIE3_1.4.3 rprojroot_1.3-2 GenomeInfoDb_1.16.0
## [8] tools_3.5.1 backports_1.1.3 R6_2.4.0 DT_0.5 colorspace_1.4-1 DBI_1.0.0 BiocGenerics_0.28.0
## [15] withr_2.1.2 bit_1.1-14 compiler_3.5.1 graph_1.58.2 Biobase_2.40.0 hdf5r_1.1.1 DelayedArray_0.8.0
## [22] scales_1.0.0 stringr_1.4.0 digest_0.6.18 rmarkdown_1.10 R.utils_2.8.0 XVector_0.20.0 pkgconfig_2.0.2
## [29] htmltools_0.3.6 bibtex_0.4.2 htmlwidgets_1.3 rlang_0.3.4 RSQLite_2.1.1 shiny_1.3.0 jsonlite_1.6
## [36] crosstalk_1.0.0 R.oo_1.22.0 RCurl_1.95-4.12 magrittr_1.5 feather_0.3.3 GenomeInfoDbData_1.1.0 Matrix_1.2-14
## [43] munsell_0.5.0 Rcpp_1.0.1 S4Vectors_0.20.1 R.methodsS3_1.7.1 stringi_1.4.3 yaml_2.2.0.9999 SummarizedExperiment_1.10.1
## [50] zlibbioc_1.26.0 plyr_1.8.4 grid_3.5.1 blob_1.1.1 parallel_3.5.1 promises_1.0.1 crayon_1.3.4
## [57] lattice_0.20-35 annotate_1.58.0 hms_0.4.2 knitr_1.22 pillar_1.3.1 GenomicRanges_1.34.0 reshape2_1.4.3
## [64] codetools_0.2-15 stats4_3.5.1 XML_3.98-1.19 evaluate_0.13 data.table_1.12.2 httpuv_1.5.1 gtable_0.3.0
## [71] xfun_0.6 mime_0.6 xtable_1.8-3 later_0.8.0 SingleCellExperiment_1.2.0 tibble_2.1.1 pheatmap_1.0.12
## [78] iterators_1.0.10 AnnotationDbi_1.44.0 memoise_1.1.0 IRanges_2.16.0 GSEABase_1.42.0
Aibar, Sara, Carmen Bravo González-Blas, Thomas Moerman, Jasper Wouters, Vân Anh Huynh-Thu, Hana Imrichová, Zeynep Kalender Atak, et al. 2017. “SCENIC: Single-Cell Regulatory Network Inference and Clustering.” Nature Methods 14 (october): 1083–6. doi:10.1038/nmeth.4463.
Davie, Kristofer, Jasper Janssens, Duygu Koldere, and “et al.” 2018. “A Single-Cell Transcriptome Atlas of the Aging Drosophila Brain.” Cell, june. doi:10.1016/j.cell.2018.05.057.
Huynh-Thu, Vân Anh, Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts. 2010. “Inferring Regulatory Networks from Expression Data Using Tree-Based Methods.” PloS One 5 (9). doi:10.1371/journal.pone.0012776.
Marbach, Daniel, James C. Costello, Robert Küffner, Nicole M. Vega, Robert J. Prill, Diogo M. Camacho, Kyle R. Allison, et al. 2012. “Wisdom of Crowds for Robust Gene Network Inference.” Nature Methods 9 (8): 796–804. doi:10.1038/nmeth.2016.