Last updated: 2026-01-16

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 8980224. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/1M_neurons_filtered_gene_bc_matrices_h5.h5
    Ignored:    data/293t/
    Ignored:    data/293t_3t3_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/293t_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/5k_Human_Donor1_PBMC_3p_gem-x_5k_Human_Donor1_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor4_PBMC_3p_gem-x_5k_Human_Donor4_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/97516b79-8d08-46a6-b329-5d0a25b0be98.h5ad
    Ignored:    data/Parent_SC3v3_Human_Glioblastoma_filtered_feature_bc_matrix.tar.gz
    Ignored:    data/brain_counts/
    Ignored:    data/cl.obo
    Ignored:    data/cl.owl
    Ignored:    data/jurkat/
    Ignored:    data/jurkat:293t_50:50_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/jurkat_293t/
    Ignored:    data/jurkat_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc20k/
    Ignored:    data/pbmc20k_seurat/
    Ignored:    data/pbmc3k.csv
    Ignored:    data/pbmc3k.csv.gz
    Ignored:    data/pbmc3k.h5ad
    Ignored:    data/pbmc3k/
    Ignored:    data/pbmc3k_bpcells_mat/
    Ignored:    data/pbmc3k_export.mtx
    Ignored:    data/pbmc3k_matrix.mtx
    Ignored:    data/pbmc3k_seurat.rds
    Ignored:    data/pbmc4k_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc_1k_v3_filtered_feature_bc_matrix.h5
    Ignored:    data/pbmc_1k_v3_raw_feature_bc_matrix.h5
    Ignored:    data/refdata-gex-GRCh38-2020-A.tar.gz
    Ignored:    data/seurat_1m_neuron.rds
    Ignored:    data/t_3k_filtered_gene_bc_matrices.tar.gz
    Ignored:    r_packages_4.4.1/
    Ignored:    r_packages_4.5.0/

Untracked files:
    Untracked:  analysis/bioc.Rmd
    Untracked:  analysis/bioc_scrnaseq.Rmd
    Untracked:  analysis/likelihood.Rmd
    Untracked:  bpcells_matrix/
    Untracked:  data/Caenorhabditis_elegans.WBcel235.113.gtf.gz
    Untracked:  data/GCF_043380555.1-RS_2024_12_gene_ontology.gaf.gz
    Untracked:  data/arab.rds
    Untracked:  data/astronomicalunit.csv
    Untracked:  data/femaleMiceWeights.csv
    Untracked:  data/lung_bcell.rds
    Untracked:  m3/
    Untracked:  women.json

Unstaged changes:
    Modified:   analysis/isoform_switch_analyzer.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/diversity.Rmd) and HTML (docs/diversity.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 8980224 Dave Tang 2026-01-16 Diversity metrics

Introduction

This notebook demonstrates various diversity metrics. Install packages, if necessary.

install.packages("vegan")

Data

Creating a simulated cell type by organoid count matrix representing scRNA-seq data from organoids generated using 5 different methods.

set.seed(1984)

cell_types <- c(
  "Neural Progenitors",
  "Neurons",
  "Astrocytes", 
  "Oligodendrocytes",
  "Microglia",
  "Endothelial",
  "Radial Glia",
  "Intermediate Progenitors"
)

count_matrix <- base::matrix(c(
  450, 480, 420, 380, 350, 340, 400, 380,
  200, 800, 150, 120, 80, 100, 250, 200,
  100, 1200, 50, 30, 20, 40, 80, 60,
  500, 300, 250, 200, 150, 100, 80, 50,
  300, 500, 200, 180, 150, 120, 200, 150
), nrow = 8, ncol = 5, byrow = FALSE)
count_matrix
     [,1] [,2] [,3] [,4] [,5]
[1,]  450  200  100  500  300
[2,]  480  800 1200  300  500
[3,]  420  150   50  250  200
[4,]  380  120   30  200  180
[5,]  350   80   20  150  150
[6,]  340  100   40  100  120
[7,]  400  250   80   80  200
[8,]  380  200   60   50  150
base::rownames(count_matrix) <- cell_types
base::colnames(count_matrix) <- letters[1:ncol(count_matrix)]

count_matrix
                           a   b    c   d   e
Neural Progenitors       450 200  100 500 300
Neurons                  480 800 1200 300 500
Astrocytes               420 150   50 250 200
Oligodendrocytes         380 120   30 200 180
Microglia                350  80   20 150 150
Endothelial              340 100   40 100 120
Radial Glia              400 250   80  80 200
Intermediate Progenitors 380 200   60  50 150

Composition

Visualise the composition.

prop_matrix <- apply(count_matrix, 2, \(x) x/sum(x))

prop_df <- base::as.data.frame(prop_matrix) |>
  tibble::rownames_to_column("cell_type") |>
  tidyr::pivot_longer(-cell_type, names_to = "method", values_to = "proportion")

ggplot(prop_df, aes(x = method, y = proportion, fill = cell_type)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(axis.text.x = ggplot2::element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Cell type composition by method",
    y = "Proportion",
    x = "Method"
  ) +
  scale_fill_brewer(palette = "Set3")

Diversity Metrics

Species Richness (S)

What it measures: Simply the number of different cell types present (with >0 cells).

Interpretation: Higher values indicate more cell types are present.

Key characteristic: Does NOT account for abundance - treats rare and common types equally.

richness <- vegan::specnumber(base::t(count_matrix))

richness_df <- base::data.frame(
  Method = base::names(richness),
  Richness = richness
)

richness_df
  Method Richness
a      a        8
b      b        8
c      c        8
d      d        8
e      e        8

Shannon Diversity Index

What it measures: Both richness and evenness combined. Quantifies uncertainty in predicting the cell type of a randomly selected cell.

Interpretation: - Higher values indicate greater diversity - Ranges from 0 (one cell type only) to ln(S) (all types equally abundant) - Values typically between 1.5-3.5 for biological communities

Key characteristic: More sensitive to rare types than Simpson’s index.

shannon <- vegan::diversity(base::t(count_matrix), index = "shannon")

shannon_df <- base::data.frame(
  Method = base::names(shannon),
  Shannon = base::round(shannon, 3)
)

shannon_df
  Method Shannon
a      a   2.073
b      b   1.768
c      c   0.992
d      d   1.865
e      e   1.968

Simpson Diversity Index

What it measures: The probability that two randomly selected cells are of different types.

Interpretation: - Ranges from 0 to 1 - Higher values indicate greater diversity - Less sensitive to rare species than Shannon

Key characteristic: Emphasizes dominant cell types more than Shannon.

simpson <- vegan::diversity(base::t(count_matrix), index = "simpson")

simpson_df <- base::data.frame(
  Method = base::names(simpson),
  Simpson = base::round(simpson, 3)
)

simpson_df
  Method Simpson
a      a   0.873
b      b   0.768
c      c   0.413
d      d   0.818
e      e   0.842

Inverse Simpson

What it measures: The effective number of cell types. Represents the number of equally abundant types needed to produce the observed diversity.

Interpretation: - Minimum value of 1 (only one type) - Maximum value equals richness (S) when all types equally abundant - More intuitive than Simpson’s D

Key characteristic: Can be directly compared to species richness.

inv_simpson <- vegan::diversity(base::t(count_matrix), index = "invsimpson")

inv_simpson_df <- base::data.frame(
  Method = base::names(inv_simpson),
  InvSimpson = base::round(inv_simpson, 2)
)

inv_simpson_df
  Method InvSimpson
a      a       7.90
b      b       4.32
c      c       1.70
d      d       5.49
e      e       6.33

Pielou’s Evenness Index

What it measures: How evenly cells are distributed among different types, independent of richness.

Interpretation: - Ranges from 0 to 1 - 1 indicates perfect evenness (all types equally abundant) - 0 indicates complete unevenness (one type dominates)

Key characteristic: Separates evenness from richness, useful for comparing samples with different numbers of cell types.

pielou <- shannon / base::log(richness)

pielou_df <- base::data.frame(
  Method = base::names(pielou),
  Pielou = base::round(pielou, 3)
)

pielou_df
  Method Pielou
a      a  0.997
b      b  0.850
c      c  0.477
d      d  0.897
e      e  0.946

Berger-Parker Dominance Index

What it measures: The proportion of the most abundant cell type.

Interpretation: - Ranges from 0 to 1 - Higher values indicate one type strongly dominates - Inverse of diversity (high dominance = low diversity)

Key characteristic: Simple, intuitive measure focusing only on the most abundant type.

berger_parker <- base::apply(count_matrix, 2, \(x) base::max(x) / base::sum(x))

bp_df <- base::data.frame(
  Method = base::names(berger_parker),
  BergerParker = base::round(berger_parker, 3)
)

bp_df
  Method BergerParker
a      a        0.150
b      b        0.421
c      c        0.759
d      d        0.307
e      e        0.278

Comparison

Summary.

summary_df <- base::data.frame(
  Richness = richness,
  Shannon = base::round(shannon, 3),
  Simpson = base::round(simpson, 3),
  InvSimpson = base::round(inv_simpson, 2),
  Pielou = base::round(pielou, 3),
  BergerParker = base::round(berger_parker, 3)
)

summary_df
  Richness Shannon Simpson InvSimpson Pielou BergerParker
a        8   2.073   0.873       7.90  0.997        0.150
b        8   1.768   0.768       4.32  0.850        0.421
c        8   0.992   0.413       1.70  0.477        0.759
d        8   1.865   0.818       5.49  0.897        0.307
e        8   1.968   0.842       6.33  0.946        0.278

Key Differences Between Metrics

  • Richness vs. Diversity Indices
    • Richness only counts types (presence/absence)
    • Shannon, Simpson, Inverse Simpson combine richness and evenness
  • Shannon vs. Simpson
    • Shannon is more sensitive to rare cell types
    • Simpson emphasizes dominant cell types
    • Shannon ranges 0 to ln(S); Simpson ranges 0 to 1
  • Simpson vs. Inverse Simpson
    • Same information, different scale
    • Inverse Simpson is more intuitive (effective number of types)
  • Evenness vs. Diversity
    • Pielou’s J’ isolates evenness from richness
    • Allows comparing organoids with different numbers of cell types
  • Dominance vs. Diversity
    • Berger-Parker focuses only on the most abundant type
    • Opposite of diversity (high dominance = low diversity)

Conclusion

  1. For overall diversity: Use Shannon or Inverse Simpson
  2. For evenness specifically: Use Pielou’s J’
  3. To identify dominant types: Use Berger-Parker
  4. For simple comparison: Use Richness and Inverse Simpson together

sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] vegan_2.7-1     permute_0.9-8   lubridate_1.9.4 forcats_1.0.0  
 [5] stringr_1.5.1   dplyr_1.1.4     purrr_1.0.4     readr_2.1.5    
 [9] tidyr_1.3.1     tibble_3.3.0    ggplot2_3.5.2   tidyverse_2.0.0
[13] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] sass_0.4.10        generics_0.1.4     lattice_0.22-6     stringi_1.8.7     
 [5] hms_1.1.3          digest_0.6.37      magrittr_2.0.3     timechange_0.3.0  
 [9] evaluate_1.0.3     grid_4.5.0         RColorBrewer_1.1-3 fastmap_1.2.0     
[13] Matrix_1.7-3       rprojroot_2.0.4    jsonlite_2.0.0     processx_3.8.6    
[17] whisker_0.4.1      ps_1.9.1           promises_1.3.3     mgcv_1.9-1        
[21] httr_1.4.7         scales_1.4.0       jquerylib_0.1.4    cli_3.6.5         
[25] rlang_1.1.6        splines_4.5.0      withr_3.0.2        cachem_1.1.0      
[29] yaml_2.3.10        parallel_4.5.0     tools_4.5.0        tzdb_0.5.0        
[33] httpuv_1.6.16      vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.4   
[37] git2r_0.36.2       fs_1.6.6           MASS_7.3-65        cluster_2.1.8.1   
[41] pkgconfig_2.0.3    callr_3.7.6        pillar_1.10.2      bslib_0.9.0       
[45] later_1.4.2        gtable_0.3.6       glue_1.8.0         Rcpp_1.0.14       
[49] xfun_0.52          tidyselect_1.2.1   rstudioapi_0.17.1  knitr_1.50        
[53] farver_2.1.2       nlme_3.1-168       htmltools_0.5.8.1  labeling_0.4.3    
[57] rmarkdown_2.29     compiler_4.5.0     getPass_0.2-4