library(miloR)
library(SingleCellExperiment)
library(scater)
library(scran)
library(dplyr)
library(patchwork)
library(MouseGastrulationData)

1 Load data

For this vignette we will use the mouse gastrulation single-cell data from Pijuan-Sala et al. 2019. The dataset can be downloaded as a SingleCellExperiment object from the MouseGastrulationData package on Bioconductor. To make computations faster, here we will download just a subset of samples, 4 samples at stage E7 and 4 samples at stage E7.5.

This dataset has already been pre-processed and contains a pca.corrected dimensionality reduction, which was built after batch correction using fastMNN.

select_samples <- c(2,  3,  6, 15,
                    # 4, 19, 
                    10, 14, 20, 30
                    #31, 32
                    )
embryo_data = EmbryoAtlasData(samples = select_samples)
embryo_data
## class: SingleCellExperiment 
## dim: 29452 14679 
## metadata(0):
## assays(1): counts
## rownames(29452): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(2): ENSEMBL SYMBOL
## colnames(14679): cell_361 cell_362 ... cell_107079 cell_107080
## colData names(17): cell barcode ... colour sizeFactor
## reducedDimNames(2): pca.corrected umap
## altExpNames(0):

2 Visualize the data

We recompute the UMAP embedding for this subset of cells to visualize the data.

embryo_data <- embryo_data[,apply(reducedDim(embryo_data, "pca.corrected"), 1, function(x) !all(is.na(x)))]
embryo_data <- runUMAP(embryo_data, dimred = "pca.corrected", name = 'umap')

plotReducedDim(embryo_data, colour_by="stage", dimred = "umap") 

We will test for significant differences in abundance of cells between these stages of development, and the associated gene signatures.

3 Differential abundance testing

3.1 Create a Milo object

For differential abundance analysis on graph neighbourhoods we first construct a Milo object. This extends the SingleCellExperiment class to store information about neighbourhoods on the KNN graph.

embryo_milo <- Milo(embryo_data)
embryo_milo
## class: Milo 
## dim: 29452 13002 
## metadata(0):
## assays(1): counts
## rownames(29452): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(2): ENSEMBL SYMBOL
## colnames(13002): cell_361 cell_362 ... cell_107079 cell_107080
## colData names(17): cell barcode ... colour sizeFactor
## reducedDimNames(2): pca.corrected umap
## altExpNames(0):
## nhoods dimensions(2): 1 1
## nhoodCounts dimensions(2): 1 1
## nhoodDistances dimension(1): 0
## graph names(0):
## nhoodIndex names(1): 0
## nhoodExpression dimension(2): 1 1
## nhoodReducedDim names(0):
## nhoodGraph names(0):
## nhoodAdjacency dimension(2): 1 1

3.2 Construct KNN graph

We need to add the KNN graph to the Milo object. This is stored in the graph slot, in igraph format. The miloR package includes functionality to build and store the graph from the PCA dimensions stored in the reducedDim slot. In this case, we specify that we want to build the graph from the MNN corrected PCA dimensions.

For graph building you need to define a few parameters:

  • d: the number of reduced dimensions to use for KNN refinement. We recommend using the same \(d\) used for KNN graph building, or to select PCs by inspecting the scree plot.
  • k: this affects the power of DA testing, since we need to have enough cells from each sample represented in a neighbourhood to estimate the variance between replicates. On the other side, increasing \(k\) too much might lead to over-smoothing. We suggest to start by using the same value for \(k\) used for KNN graph building for clustering and UMAP visualization. We will later use some heuristics to evaluate whether the value of \(k\) should be increased.
embryo_milo <- buildGraph(embryo_milo, k = 30, d = 30, reduced.dim = "pca.corrected")

Alternatively, one can add a precomputed KNN graph (for example constructed with Seurat or scanpy) to the graph slot using the adjacency matrix, through the helper function buildFromAdjacency.

3.3 Defining representative neighbourhoods on the KNN graph

We define the neighbourhood of a cell, the index, as the group of cells connected by an edge in the KNN graph to the index cell. For efficiency, we don’t test for DA in the neighbourhood of every cell, but we sample as indices a subset of representative cells, using a KNN sampling algorithm used by Gut et al. 2015.

As well as \(d\) and \(k\), for sampling we need to define a few additional parameters:

  • prop: the proportion of cells to randomly sample to start with. We suggest using prop=0.1 for datasets of less than 30k cells. For bigger datasets using prop=0.05 should be sufficient (and makes computation faster).
  • refined: indicates whether you want to use the sampling refinement algorith, or just pick cells at random. The default and recommended way to go is to use refinement. The only situation in which you might consider using random instead, is if you have batch corrected your data with a graph based correction algorithm, such as BBKNN, but the results of DA testing will be suboptimal.
embryo_milo <- makeNhoods(embryo_milo, prop = 0.1, k = 30, d=30, refined = TRUE, reduced_dims = "pca.corrected")

Once we have defined neighbourhoods, we plot the distribution of neighbourhood sizes (i.e. how many cells form each neighbourhood) to evaluate whether the value of \(k\) used for graph building was appropriate. We can check this out using the plotNhoodSizeHist function.

As a rule of thumb we want to have an average neighbourhood size over 5 x N_samples. If the mean is lower, or if the distribution is

plotNhoodSizeHist(embryo_milo)

3.4 Counting cells in neighbourhoods

Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata and specify which column contains the sample information.

embryo_milo <- countCells(embryo_milo, meta.data = as.data.frame(colData(embryo_milo)), sample="sample")

This adds to the Milo object a \(n \times m\) matrix, where \(n\) is the number of neighbourhoods and \(m\) is the number of experimental samples. Values indicate the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing.

head(nhoodCounts(embryo_milo))
## 6 x 8 sparse Matrix of class "dgCMatrix"
##   2  3  6 15 10 14 20 30
## 1 1  1 13 11 26 12  5 33
## 2 3  4 24  1  .  . 11  .
## 3 1 20 32  .  1  . 27  1
## 4 1  . 27 16 32 13  5 23
## 5 5  6 84  4  5  1 40 10
## 6 1  . 42 10  6  5 10  9

3.5 Defining experimental design

Now we are all set to test for differential abundance in neighbourhoods. We implement this hypothesis testing in a generalized linear model (GLM) framework, specifically using the Negative Binomial GLM implementation in edgeR.

We first need to think about our experimental design. The design matrix should match each sample to the experimental condition of interest for DA testing. In this case, we want to detect DA between embryonic stages, stored in the stage column of the dataset colData. We also include the sequencing.batch column in the design matrix. This represents a known technical covariate that we want to account for in DA testing.

embryo_design <- data.frame(colData(embryo_milo))[,c("sample", "stage", "sequencing.batch")]

## Convert batch info from integer to factor
embryo_design$sequencing.batch <- as.factor(embryo_design$sequencing.batch) 
embryo_design <- distinct(embryo_design)
rownames(embryo_design) <- embryo_design$sample

embryo_design
##    sample stage sequencing.batch
## 2       2  E7.5                1
## 3       3  E7.5                1
## 6       6  E7.5                1
## 15     15  E7.0                2
## 10     10  E7.0                1
## 14     14  E7.0                2
## 20     20  E7.5                2
## 30     30  E7.0                3

3.6 Computing neighbourhood connectivity

Milo uses an adaptation of the Spatial FDR correction introduced by cydar, where we correct p-values accounting for the amount of overlap between neighbourhoods. Specifically, each hypothesis test P-value is weighted by the reciprocal of the kth nearest neighbour distance. To use this statistic we first need to store the distances between nearest neighbors in the Milo object. This is done by the calcNhoodDistance function (N.B. this step is the most time consuming of the analysis workflow and might take a couple of minutes for large datasets).

embryo_milo <- calcNhoodDistance(embryo_milo, d=30, reduced.dim = "pca.corrected")

3.7 Testing

Now we can do the DA test, explicitly defining our experimental design. In this case, we want to test for differences between experimental stages, while accounting for the variability between technical batches (You can find more info on how to use formulas to define a testing design in R here)

da_results <- testNhoods(embryo_milo, design = ~ sequencing.batch + stage, design.df = embryo_design)
## Warning in if (colnames(nhoodCounts(x)) != rownames(model)) {: the condition has
## length > 1 and only the first element will be used
head(da_results)
##        logFC   logCPM         F      PValue        FDR Nhood SpatialFDR
## 1 -1.8019342 20.07089 1.5747061 0.209611035 0.29435477     1 0.29385683
## 2  4.7439141 19.12468 7.0132433 0.008128239 0.03458566     2 0.03452842
## 3  5.4795871 19.87780 8.6993418 0.003204932 0.02281712     3 0.02248097
## 4 -1.6502089 20.24398 1.2210504 0.269232708 0.36142976     4 0.36054817
## 5  3.2520711 20.61980 3.6364192 0.056612494 0.11608972     5 0.11588354
## 6  0.8336509 19.82476 0.2813193 0.595872418 0.68340624     6 0.68348253

This calculates a Fold-change and corrected P-value for each neighbourhood, which indicates wheather there is significant differential abundance between developmental stages. The main statistics we consider here are:

  • logFC: indicates the log-Fold change in cell numbers between samples from E7.5 and samples from E7.0
  • PValue: reports P-values before FDR correction
  • SpatialFDR: reports P-values corrected for multiple testing accounting for overlap between neighbourhoods
da_results %>%
  arrange(SpatialFDR) %>%
  head() 
##         logFC   logCPM        F       PValue         FDR Nhood  SpatialFDR
## 17  -6.823950 19.52586 12.87253 0.0003381350 0.009215665    17 0.008646251
## 47  -7.291677 19.69498 14.41699 0.0001490295 0.009215665    47 0.008646251
## 73   7.481586 19.66873 13.11078 0.0002978813 0.009215665    73 0.008646251
## 88  -7.590249 19.85551 13.11803 0.0002967336 0.009215665    88 0.008646251
## 136  7.288474 19.63873 12.71007 0.0003686914 0.009215665   136 0.008646251
## 142 -7.125816 19.65072 14.21796 0.0001655750 0.009215665   142 0.008646251

4 Inspecting DA testing results

We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots. We first inspect the distribution of uncorrected P values, to verify that the test was balanced.

ggplot(da_results, aes(PValue)) + geom_histogram(bins=50)

Then we visualize the test results with a volcano plot (remember that each point here represents a neighbourhood, not a cell).

ggplot(da_results, aes(logFC, -log10(SpatialFDR))) + 
  geom_point() +
  geom_hline(yintercept = 1) ## Mark significance threshold (10% FDR)

Looks like we have detected several neighbourhoods were there is a significant difference in cell abundances between developmental stages.

To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding. Here each node represents a neighbourhood, while edges indicate how many cells two neighbourhoods have in common. Here the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying singificant DA are colored by their log-Fold Change.

embryo_milo <- buildNhoodGraph(embryo_milo)

## Plot single-cell UMAP
umap_pl <- plotReducedDim(embryo_milo, dimred = "umap", colour_by="stage", text_by = "celltype", 
                          text_size = 3, point_size=0.5) +
  guides(fill="none")

## Plot neighbourhood graph
nh_graph_pl <- plotNhoodGraphDA(embryo_milo, da_results, layout="umap",alpha=0.1) 
  
umap_pl + nh_graph_pl +
  plot_layout(guides="collect")

We might also be interested in visualizing wheather DA is particularly evident in certain cell types. To do this, we assign a cell type label to each neighbourhood by finding the most abundant cell type within cells in each neighbourhood. We can label neighbourhoods in the results data.frame using the function annotateNhoods. This also saves the fraction of cells harbouring the label.

da_results <- annotateNhoods(embryo_milo, da_results, coldata_col = "celltype")
head(da_results)
##        logFC   logCPM         F      PValue        FDR Nhood SpatialFDR
## 1 -1.8019342 20.07089 1.5747061 0.209611035 0.29435477     1 0.29385683
## 2  4.7439141 19.12468 7.0132433 0.008128239 0.03458566     2 0.03452842
## 3  5.4795871 19.87780 8.6993418 0.003204932 0.02281712     3 0.02248097
## 4 -1.6502089 20.24398 1.2210504 0.269232708 0.36142976     4 0.36054817
## 5  3.2520711 20.61980 3.6364192 0.056612494 0.11608972     5 0.11588354
## 6  0.8336509 19.82476 0.2813193 0.595872418 0.68340624     6 0.68348253
##               celltype celltype_fraction
## 1         ExE endoderm         1.0000000
## 2 Rostral neurectoderm         0.5581395
## 3         ExE endoderm         1.0000000
## 4         ExE ectoderm         1.0000000
## 5         ExE ectoderm         1.0000000
## 6         ExE ectoderm         1.0000000

While neighbourhoods tend to be homogeneous, we can define a threshold for celltype_fraction to exclude neighbourhoods that are a mix of cell types.

ggplot(da_results, aes(celltype_fraction)) + geom_histogram(bins=50)

da_results$celltype <- ifelse(da_results$celltype_fraction < 0.7, "Mixed", da_results$celltype)

Now we can visualize the distribution of DA Fold Changes in different cell types

plotDAbeeswarm(da_results, group.by = "celltype")

This is already quite informative: we can see that certain early development cell types, such as epiblast and primitive streak, are enriched in the earliest time stage, while others are enriched later in development, such as ectoderm cells. Interestingly, we also see plenty of DA neighbourhood with a mixed label. This could indicate that transitional states show changes in abundance in time.

5 Finding markers of DA populations

Once you have found your neighbourhoods showindg significant DA between conditions, you might want to find gene signatures specific to the cells in those neighbourhoods. The function findNhoodGroupMarkers runs a one-VS-all differential gene expression test to identify marker genes for a group of neighbourhoods of interest. Before running this function you will need to define your neighbourhood groups depending on your biological question, that need to be stored as a NhoodGroup column in the da_results data.frame.

5.0.1 Custom grouping

In a case where all the DA neighbourhoods seem to belong to the same region of the graph, you might just want to test the significant DA neighbourhoods with the same logFC against all the rest (N.B. for illustration purposes, here I am testing on a randomly selected set of 10 genes).

## Add log normalized count to Milo object
embryo_milo <- logNormCounts(embryo_milo)

da_results$NhoodGroup <- as.numeric(da_results$SpatialFDR < 0.1 & da_results$logFC < 0)
da_nhood_markers <- findNhoodGroupMarkers(embryo_milo, da_results, subset.row = rownames(embryo_milo)[1:10])
## Warning: Zero sample variances detected, have been offset away from zero

## Warning: Zero sample variances detected, have been offset away from zero
head(da_nhood_markers)
##                          logFC_0  adj.P.Val_0       logFC_1  adj.P.Val_1
## ENSMUSG00000025902  2.419208e-01 8.777861e-55  0.1796267114 3.415641e-18
## ENSMUSG00000033813 -1.248384e-01 1.008030e-21  0.1473799333 3.709619e-12
## ENSMUSG00000025903 -2.688184e-02 1.067889e-01 -0.0992316258 2.021336e-04
## ENSMUSG00000033845 -1.877176e-02 4.098436e-01 -0.0548703559 2.774132e-02
## ENSMUSG00000102343  8.066721e-05 1.000000e+00 -0.0004663117 1.000000e+00
## ENSMUSG00000104217  2.243763e-04 1.000000e+00 -0.0001441491 1.000000e+00
##                                GeneID
## ENSMUSG00000025902 ENSMUSG00000025903
## ENSMUSG00000033813 ENSMUSG00000033813
## ENSMUSG00000025903 ENSMUSG00000025902
## ENSMUSG00000033845 ENSMUSG00000033845
## ENSMUSG00000102343 ENSMUSG00000104217
## ENSMUSG00000104217 ENSMUSG00000025900

For this analysis we recommend aggregating the neighbourhood expression profiles by experimental samples (the same used for DA testing), by setting aggregate.samples=TRUE. This way single-cells will not be considered as “replicates” during DGE testing, and dispersion will be estimated between true biological replicates. Like so:

da_nhood_markers <- findNhoodGroupMarkers(embryo_milo, da_results, subset.row = rownames(embryo_milo)[1:10], 
                                          aggregate.samples = TRUE, sample_col = "sample")
## Warning: Zero sample variances detected, have been offset away from zero

## Warning: Zero sample variances detected, have been offset away from zero
head(da_nhood_markers)
##                          logFC_0 adj.P.Val_0       logFC_1 adj.P.Val_1
## ENSMUSG00000025902  2.953881e-01   0.4036165 -0.1986203567           1
## ENSMUSG00000051951 -6.489327e-03   0.8616749 -0.1229491134           1
## ENSMUSG00000033813 -1.007680e-01   0.8616749 -0.0004523552           1
## ENSMUSG00000104217  4.243912e-04   0.8616749 -0.0003297304           1
## ENSMUSG00000025900  3.468931e-04   0.8616749 -0.0029750363           1
## ENSMUSG00000102343  3.188358e-05   0.8616749 -0.0131102447           1
##                                GeneID
## ENSMUSG00000025902 ENSMUSG00000025902
## ENSMUSG00000051951 ENSMUSG00000033813
## ENSMUSG00000033813 ENSMUSG00000104217
## ENSMUSG00000104217 ENSMUSG00000025900
## ENSMUSG00000025900 ENSMUSG00000051951
## ENSMUSG00000102343 ENSMUSG00000025903

(Notice the difference in p values)

5.1 Automatic grouping of neighbourhoods

In many cases, such as this example, DA neighbourhoods are found in different areas of the KNN graph, and grouping together all significant DA populations might not be ideal, as they might include cells of very different celltypes. For this kind of scenario, we have implemented a neighbourhood function that uses community detection to partition neighbourhoods into groups on the basis of (1) the number of shared cells between 2 neighbourhoods; (2) the direction of fold-change for DA neighbourhoods; (3) the difference in fold change.

## Run buildNhoodGraph to store nhood adjacency matrix
embryo_milo <- buildNhoodGraph(embryo_milo)

## Find groups
da_results <- groupNhoods(embryo_milo, da_results, max.lfc.delta = 2)
head(da_results)
##        logFC   logCPM         F      PValue        FDR Nhood SpatialFDR
## 1 -1.8019342 20.07089 1.5747061 0.209611035 0.29435477     1 0.29385683
## 2  4.7439141 19.12468 7.0132433 0.008128239 0.03458566     2 0.03452842
## 3  5.4795871 19.87780 8.6993418 0.003204932 0.02281712     3 0.02248097
## 4 -1.6502089 20.24398 1.2210504 0.269232708 0.36142976     4 0.36054817
## 5  3.2520711 20.61980 3.6364192 0.056612494 0.11608972     5 0.11588354
## 6  0.8336509 19.82476 0.2813193 0.595872418 0.68340624     6 0.68348253
##       celltype celltype_fraction NhoodGroup
## 1 ExE endoderm         1.0000000          2
## 2        Mixed         0.5581395         10
## 3 ExE endoderm         1.0000000          2
## 4 ExE ectoderm         1.0000000          7
## 5 ExE ectoderm         1.0000000          7
## 6 ExE ectoderm         1.0000000          7

Let’s have a look at the detected groups

plotNhoodGroups(embryo_milo, da_results, layout="umap") 

plotDAbeeswarm(da_results, "NhoodGroup")

We can see that some groups are made up of single neighbourhoods.

Let’s check how changing the grouping parameters changes the groups we obtain, starting with the LFC delta…

plotDAbeeswarm(groupNhoods(embryo_milo, da_results, max.lfc.delta = 0.5), group.by = "NhoodGroup") + ggtitle("max LFC delta=0.5")

plotDAbeeswarm(groupNhoods(embryo_milo, da_results, max.lfc.delta = 1) , group.by = "NhoodGroup") + ggtitle("max LFC delta=1")

plotDAbeeswarm(groupNhoods(embryo_milo, da_results, max.lfc.delta = 2)   , group.by = "NhoodGroup") + ggtitle("max LFC delta=2")

plotDAbeeswarm(groupNhoods(embryo_milo, da_results, max.lfc.delta = 3)   , group.by = "NhoodGroup") + ggtitle("max LFC delta=3")

…and minimum overlap between neighbourhoods…

plotDAbeeswarm(groupNhoods(embryo_milo, da_results, max.lfc.delta = 3, overlap=1), group.by = "NhoodGroup") + ggtitle("overlap=1")

plotDAbeeswarm(groupNhoods(embryo_milo, da_results, max.lfc.delta = 3, overlap=5) , group.by = "NhoodGroup") + ggtitle("overlap=5")

plotDAbeeswarm(groupNhoods(embryo_milo, da_results, max.lfc.delta = 3, overlap=10)   , group.by = "NhoodGroup") + ggtitle("overlap=10")

plotDAbeeswarm(groupNhoods(embryo_milo, da_results, max.lfc.delta = 3, overlap=20)   , group.by = "NhoodGroup") + ggtitle("overlap=20")

Let’s settle for overlap=5 and max.lfc.delta=3

set.seed(42)
da_results <- groupNhoods(embryo_milo, da_results, max.lfc.delta = 3, overlap=5)
plotNhoodGroups(embryo_milo, da_results, layout="umap")

plotDAbeeswarm(da_results, group.by = "NhoodGroup")

5.2 Finding gene signatures for neighbourhoods

Once we have grouped neighbourhoods using groupNhoods we are now all set to identifying gene signatures between neighbourhood groups.

Let’s restrict the testing to highly variable genes in this case

## Exclude zero counts genes
keep.rows <- rowSums(logcounts(embryo_milo)) != 0
embryo_milo <- embryo_milo[keep.rows, ]

## Find HVGs
dec <- modelGeneVar(embryo_milo)
hvgs <- getTopHVGs(dec, n=2000)
head(hvgs)
## [1] "ENSMUSG00000095180" "ENSMUSG00000032083" "ENSMUSG00000061808"
## [4] "ENSMUSG00000002985" "ENSMUSG00000010760" "ENSMUSG00000024990"

We run findNhoodGroupMarkers to test for one-vs-all differential gene expression for each neighbourhood group

nhood_markers <- findNhoodGroupMarkers(embryo_milo, da_results, subset.row = hvgs, 
                                       aggregate.samples = TRUE, sample_col = "sample")

head(nhood_markers)
##                     logFC_10 adj.P.Val_10  logFC_12 adj.P.Val_12  logFC_11
## ENSMUSG00000043501 2.9646465 1.562878e-57 0.7691716 4.553362e-12 1.5858477
## ENSMUSG00000054422 1.4701245 3.847251e-57 0.3125880 2.240048e-07 1.6693077
## ENSMUSG00000028435 1.8249161 6.008323e-50 0.5799020 6.540830e-07 0.6492358
## ENSMUSG00000028307 2.1875224 3.299047e-41 1.0306744 6.540830e-07 2.5430717
## ENSMUSG00000031883 2.7333457 4.601569e-41 0.5710806 6.540830e-07 2.3425176
## ENSMUSG00000002992 0.9654865 4.601569e-41 0.3108632 3.349925e-05 2.3459158
##                    adj.P.Val_11  logFC_15 adj.P.Val_15  logFC_13 adj.P.Val_13
## ENSMUSG00000043501 4.924399e-40 2.7414106 3.649100e-55 0.5929410 2.963965e-15
## ENSMUSG00000054422 7.870376e-32 0.7751874 3.137399e-51 1.1436480 2.326333e-14
## ENSMUSG00000028435 9.946030e-31 3.3267822 2.228196e-35 0.7823292 2.326333e-14
## ENSMUSG00000028307 2.468419e-30 1.6718590 3.166183e-34 0.7689160 7.883357e-13
## ENSMUSG00000031883 9.668985e-27 0.8270745 3.338418e-24 1.2940178 7.883357e-13
## ENSMUSG00000002992 1.112635e-24 0.4688946 1.626136e-21 0.9092562 1.924512e-12
##                      logFC_4  adj.P.Val_4   logFC_8  adj.P.Val_8   logFC_7
## ENSMUSG00000043501 1.3937635 3.594276e-12 0.8356951 1.779965e-09 2.1814056
## ENSMUSG00000054422 0.6063110 2.194208e-06 0.8060153 3.130049e-08 0.5805507
## ENSMUSG00000028435 0.6798852 1.183039e-05 0.6241943 6.125129e-08 0.1715966
## ENSMUSG00000028307 0.7938846 1.471772e-04 0.4258766 6.160555e-08 0.8388038
## ENSMUSG00000031883 1.6592479 2.681678e-04 1.8805559 6.160555e-08 1.9510637
## ENSMUSG00000002992 0.4938781 4.159978e-04 0.3897134 6.160555e-08 0.4347922
##                     adj.P.Val_7   logFC_6  adj.P.Val_6  logFC_1  adj.P.Val_1
## ENSMUSG00000043501 2.477879e-09 1.4745482 1.328073e-47 3.787710 9.968649e-84
## ENSMUSG00000054422 7.600250e-07 0.4553959 3.619561e-36 4.076044 2.927961e-83
## ENSMUSG00000028435 7.944728e-07 0.7582259 1.945637e-27 1.869432 5.394850e-63
## ENSMUSG00000028307 1.430430e-05 0.4364792 3.662407e-19 3.774259 7.341302e-58
## ENSMUSG00000031883 1.186845e-04 3.2920842 2.163825e-18 1.598884 2.238933e-57
## ENSMUSG00000002992 2.387520e-04 0.8327464 3.477533e-16 4.088254 1.600801e-54
##                      logFC_5  adj.P.Val_5   logFC_9  adj.P.Val_9   logFC_2
## ENSMUSG00000043501 3.0517156 5.701994e-25 0.9072267 1.893202e-17 0.5799201
## ENSMUSG00000054422 1.3293330 2.120583e-13 1.4132221 5.437680e-17 2.2259883
## ENSMUSG00000028435 1.7193466 6.669500e-13 0.9072541 7.926351e-15 2.4754683
## ENSMUSG00000028307 0.8911453 1.718046e-12 0.7819596 3.182970e-14 1.8041112
## ENSMUSG00000031883 1.0648664 1.181642e-11 0.8805972 7.630000e-14 0.3507645
## ENSMUSG00000002992 3.1886923 2.423750e-11 1.3891633 2.984042e-13 0.4000897
##                     adj.P.Val_2  logFC_14 adj.P.Val_14  logFC_3  adj.P.Val_3
## ENSMUSG00000043501 9.159978e-42 3.1678372 7.212461e-69 2.220615 5.330687e-29
## ENSMUSG00000054422 2.615762e-37 1.6016211 5.516361e-65 1.577827 2.890586e-22
## ENSMUSG00000028435 1.066192e-32 2.0085002 4.396394e-36 2.146765 8.474129e-22
## ENSMUSG00000028307 2.971374e-30 1.3122062 6.678507e-34 1.600663 4.956831e-19
## ENSMUSG00000031883 2.150877e-26 0.7504527 1.923460e-32 2.007259 2.023994e-17
## ENSMUSG00000002992 8.499734e-24 0.8508131 1.625929e-31 1.669676 3.183487e-16
##                                GeneID
## ENSMUSG00000043501 ENSMUSG00000040111
## ENSMUSG00000054422 ENSMUSG00000027217
## ENSMUSG00000028435 ENSMUSG00000072770
## ENSMUSG00000028307 ENSMUSG00000043068
## ENSMUSG00000031883 ENSMUSG00000038366
## ENSMUSG00000002992 ENSMUSG00000032741

Let’s check out the markers for group 2 for example

gr2_markers <- nhood_markers[c("logFC_2", "adj.P.Val_2")] 
colnames(gr2_markers) <- c("logFC", "adj.P.Val")

head(gr2_markers[order(gr2_markers$adj.P.Val), ])
##                        logFC    adj.P.Val
## ENSMUSG00000043501 0.5799201 9.159978e-42
## ENSMUSG00000054422 2.2259883 2.615762e-37
## ENSMUSG00000028435 2.4754683 1.066192e-32
## ENSMUSG00000028307 1.8041112 2.971374e-30
## ENSMUSG00000031883 0.3507645 2.150877e-26
## ENSMUSG00000002992 0.4000897 8.499734e-24

If you already know you are interested only in the markers for group 2, you might want to test just 8-VS-all using the subset.groups parameter:

nhood_markers <- findNhoodGroupMarkers(embryo_milo, da_results, subset.row = hvgs, 
                                       aggregate.samples = TRUE, sample_col = "sample",
                                       subset.groups = c("2")
                                       )

head(nhood_markers)
##                      logFC_2  adj.P.Val_2             GeneID
## ENSMUSG00000028011 0.5799201 9.159978e-42 ENSMUSG00000028011
## ENSMUSG00000046550 2.2259883 2.615762e-37 ENSMUSG00000046550
## ENSMUSG00000018217 2.4754683 1.066192e-32 ENSMUSG00000018217
## ENSMUSG00000042812 1.8041112 2.971374e-30 ENSMUSG00000042812
## ENSMUSG00000000690 0.3507645 2.150877e-26 ENSMUSG00000000690
## ENSMUSG00000042448 0.4000897 8.499734e-24 ENSMUSG00000042448

You might also want to compare a subset of neighbourhoods between each other. You can specify the neighbourhoods to use for testing by setting the parameter subset.nhoods.

For example, you might want to compare just one pair of neighbourhood groups against each other:

nhood_markers <- findNhoodGroupMarkers(embryo_milo, da_results, subset.row = hvgs,
                                       subset.nhoods = da_results$NhoodGroup %in% c('11','2'),
                                       aggregate.samples = TRUE, sample_col = "sample")

head(nhood_markers)
##                     logFC_11 adj.P.Val_11   logFC_2  adj.P.Val_2
## ENSMUSG00000095180  5.068741 4.862626e-16 -5.068741 4.862626e-16
## ENSMUSG00000028773  4.016353 1.447469e-15 -4.016353 1.447469e-15
## ENSMUSG00000031765  3.376249 3.874846e-15 -3.376249 3.874846e-15
## ENSMUSG00000047751  3.362640 7.910282e-15 -3.362640 7.910282e-15
## ENSMUSG00000000184 -2.525007 3.996510e-14  2.525007 3.996510e-14
## ENSMUSG00000028832 -2.615901 3.996510e-14  2.615901 3.996510e-14
##                                GeneID
## ENSMUSG00000095180 ENSMUSG00000095180
## ENSMUSG00000028773 ENSMUSG00000028773
## ENSMUSG00000031765 ENSMUSG00000031765
## ENSMUSG00000047751 ENSMUSG00000047751
## ENSMUSG00000000184 ENSMUSG00000000184
## ENSMUSG00000028832 ENSMUSG00000028832

or you might use subset.nhoods to exclude singleton neighbourhoods, or to subset to the neighbourhoods that show significant DA.

5.3 Visualize detected markers

Lets select marker genes for group 2 at FDR 1% and log-fold-Change > 1.

ggplot(nhood_markers, aes(logFC_2,-log10(adj.P.Val_2 ))) + 
  geom_point(alpha=0.5, size=0.5) +
  geom_hline(yintercept = 2)

markers <- rownames(nhood_markers)[nhood_markers$adj.P.Val_2 < 0.01 & nhood_markers$logFC_2 > 0]

We can visualize the expression in neighbourhoods using plotNhoodExpressionGroups.

plotNhoodExpressionGroups(embryo_milo, da_results, features=markers,
                          subset.nhoods = da_results$NhoodGroup %in% c('11','2'), scale=TRUE,
                          grid.space = "fixed")
## Warning in plotNhoodExpressionGroups(embryo_milo, da_results, features =
## markers, : Nothing in nhoodExpression(x): computing for requested features...

5.4 DGE testing within a group

In some cases you might want to test for differential expression between cells in different conditions within the same neighbourhood group. You can do that using testDiffExp:

dge_9 <- testDiffExp(embryo_milo, da_results, design = ~ stage, meta.data = data.frame(colData(embryo_milo)),
                     subset.row = rownames(embryo_milo)[1:10], subset.nhoods=da_results$NhoodGroup=="9")
## Warning: Zero sample variances detected, have been offset away from zero
dge_9
## $`9`
##                          logFC    AveExpr          t     P.Value  adj.P.Val
## ENSMUSG00000025903 -0.53840267 0.61175858 -2.8612745 0.004938901 0.04938901
## ENSMUSG00000033793 -0.19057035 0.32202361 -1.3132813 0.191465897 0.95732948
## ENSMUSG00000033813  0.16899924 0.74153490  0.8317037 0.407143832 1.00000000
## ENSMUSG00000025902 -0.01279485 0.01219552 -0.3223327 0.747732928 1.00000000
## ENSMUSG00000033845  0.07238215 2.13673807  0.3033422 0.762127559 1.00000000
## ENSMUSG00000051951  0.00000000 0.00000000  0.0000000 1.000000000 1.00000000
## ENSMUSG00000102343  0.00000000 0.00000000  0.0000000 1.000000000 1.00000000
## ENSMUSG00000025900  0.00000000 0.00000000  0.0000000 1.000000000 1.00000000
## ENSMUSG00000104217  0.00000000 0.00000000  0.0000000 1.000000000 1.00000000
## ENSMUSG00000002459  0.00000000 0.00000000  0.0000000 1.000000000 1.00000000
##                             B Nhood.Group
## ENSMUSG00000025903  -7.942018           9
## ENSMUSG00000033793 -11.076487           9
## ENSMUSG00000033813 -11.592135           9
## ENSMUSG00000025902 -11.887447           9
## ENSMUSG00000033845 -11.893431           9
## ENSMUSG00000051951 -11.939786           9
## ENSMUSG00000102343 -11.939786           9
## ENSMUSG00000025900 -11.939786           9
## ENSMUSG00000104217 -11.939786           9
## ENSMUSG00000002459 -11.939786           9
Session Info
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
##  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
##  [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
##  [7] LC_PAPER=en_US.utf8       LC_NAME=C                
##  [9] LC_ADDRESS=C              LC_TELEPHONE=C           
## [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scran_1.18.5                MouseGastrulationData_1.4.0
##  [3] patchwork_1.1.1             dplyr_1.0.4                
##  [5] scater_1.18.5               ggplot2_3.3.3              
##  [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
##  [9] DelayedArray_0.16.1         MatrixGenerics_1.2.1       
## [11] Matrix_1.3-2                BiocParallel_1.24.1        
## [13] matrixStats_0.58.0          Biobase_2.50.0             
## [15] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
## [17] IRanges_2.24.1              S4Vectors_0.28.1           
## [19] BiocGenerics_0.36.0         miloR_0.99.0               
## [21] edgeR_3.32.1                limma_3.46.0               
## [23] BiocStyle_2.18.1           
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.6.0              colorspace_2.0-0             
##   [3] ellipsis_0.3.1                scuttle_1.0.4                
##   [5] bluster_1.0.0                 XVector_0.30.0               
##   [7] BiocNeighbors_1.8.2           farver_2.0.3                 
##   [9] graphlayouts_0.7.1            ggrepel_0.9.1                
##  [11] bit64_4.0.5                   AnnotationDbi_1.52.0         
##  [13] interactiveDisplayBase_1.28.0 splines_4.0.4                
##  [15] codetools_0.2-18              sparseMatrixStats_1.2.1      
##  [17] cachem_1.0.4                  knitr_1.31                   
##  [19] polyclip_1.10-0               dbplyr_2.1.0                 
##  [21] ggforce_0.3.2                 shiny_1.6.0                  
##  [23] BiocManager_1.30.10           compiler_4.0.4               
##  [25] httr_1.4.2                    dqrng_0.2.1                  
##  [27] assertthat_0.2.1              fastmap_1.1.0                
##  [29] later_1.1.0.1                 tweenr_1.0.1                 
##  [31] BiocSingular_1.6.0            htmltools_0.5.1.1            
##  [33] tools_4.0.4                   rsvd_1.0.3                   
##  [35] igraph_1.2.6                  gtable_0.3.0                 
##  [37] glue_1.4.2                    GenomeInfoDbData_1.2.4       
##  [39] rappdirs_0.3.3                Rcpp_1.0.6                   
##  [41] vctrs_0.3.6                   ExperimentHub_1.16.0         
##  [43] DelayedMatrixStats_1.12.3     ggraph_2.0.5                 
##  [45] xfun_0.21                     stringr_1.4.0                
##  [47] beachmat_2.6.4                mime_0.10                    
##  [49] lifecycle_1.0.0               irlba_2.3.3                  
##  [51] gtools_3.8.2                  statmod_1.4.35               
##  [53] AnnotationHub_2.22.0          zlibbioc_1.36.0              
##  [55] MASS_7.3-53.1                 scales_1.1.1                 
##  [57] tidygraph_1.2.0               promises_1.2.0.1             
##  [59] RColorBrewer_1.1-2            yaml_2.2.1                   
##  [61] curl_4.3                      memoise_2.0.0                
##  [63] gridExtra_2.3                 stringi_1.5.3                
##  [65] RSQLite_2.2.3                 highr_0.8                    
##  [67] BiocVersion_3.12.0            rlang_0.4.10                 
##  [69] pkgconfig_2.0.3               bitops_1.0-6                 
##  [71] evaluate_0.14                 lattice_0.20-41              
##  [73] purrr_0.3.4                   labeling_0.4.2               
##  [75] cowplot_1.1.1                 bit_4.0.4                    
##  [77] tidyselect_1.1.0              magrittr_2.0.1               
##  [79] bookdown_0.21                 R6_2.5.0                     
##  [81] generics_0.1.0                DBI_1.1.1                    
##  [83] pillar_1.4.7                  withr_2.4.1                  
##  [85] RCurl_1.98-1.2                tibble_3.0.6                 
##  [87] crayon_1.4.1                  BiocFileCache_1.14.0         
##  [89] rmarkdown_2.6                 viridis_0.5.1                
##  [91] locfit_1.5-9.4                grid_4.0.4                   
##  [93] blob_1.2.1                    digest_0.6.27                
##  [95] xtable_1.8-4                  tidyr_1.1.2                  
##  [97] httpuv_1.5.5                  munsell_0.5.0                
##  [99] beeswarm_0.2.3                viridisLite_0.3.0            
## [101] vipor_0.4.5