Last updated: 2026-02-16

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version b345a9c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/1M_neurons_filtered_gene_bc_matrices_h5.h5
    Ignored:    data/293t/
    Ignored:    data/293t_3t3_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/293t_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/5k_Human_Donor1_PBMC_3p_gem-x_5k_Human_Donor1_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor4_PBMC_3p_gem-x_5k_Human_Donor4_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/97516b79-8d08-46a6-b329-5d0a25b0be98.h5ad
    Ignored:    data/Parent_SC3v3_Human_Glioblastoma_filtered_feature_bc_matrix.tar.gz
    Ignored:    data/brain_counts/
    Ignored:    data/cl.obo
    Ignored:    data/cl.owl
    Ignored:    data/jurkat/
    Ignored:    data/jurkat:293t_50:50_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/jurkat_293t/
    Ignored:    data/jurkat_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc20k/
    Ignored:    data/pbmc20k_seurat/
    Ignored:    data/pbmc3k.csv
    Ignored:    data/pbmc3k.csv.gz
    Ignored:    data/pbmc3k.h5ad
    Ignored:    data/pbmc3k/
    Ignored:    data/pbmc3k_bpcells_mat/
    Ignored:    data/pbmc3k_export.mtx
    Ignored:    data/pbmc3k_matrix.mtx
    Ignored:    data/pbmc3k_seurat.rds
    Ignored:    data/pbmc4k_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc_1k_v3_filtered_feature_bc_matrix.h5
    Ignored:    data/pbmc_1k_v3_raw_feature_bc_matrix.h5
    Ignored:    data/refdata-gex-GRCh38-2020-A.tar.gz
    Ignored:    data/seurat_1m_neuron.rds
    Ignored:    data/t_3k_filtered_gene_bc_matrices.tar.gz
    Ignored:    r_packages_4.5.2/

Untracked files:
    Untracked:  .claude/
    Untracked:  CLAUDE.md
    Untracked:  analysis/.claude/
    Untracked:  analysis/bioc.Rmd
    Untracked:  analysis/bioc_scrnaseq.Rmd
    Untracked:  analysis/chick_weight.Rmd
    Untracked:  analysis/likelihood.Rmd
    Untracked:  analysis/modelling.Rmd
    Untracked:  analysis/sim_evolution.Rmd
    Untracked:  analysis/wordpress_readability.Rmd
    Untracked:  bpcells_matrix/
    Untracked:  data/Caenorhabditis_elegans.WBcel235.113.gtf.gz
    Untracked:  data/GCF_043380555.1-RS_2024_12_gene_ontology.gaf.gz
    Untracked:  data/SeuratObj.rds
    Untracked:  data/arab.rds
    Untracked:  data/astronomicalunit.csv
    Untracked:  data/davetang039sblog.WordPress.2026-02-12.xml
    Untracked:  data/femaleMiceWeights.csv
    Untracked:  data/lung_bcell.rds
    Untracked:  m3/
    Untracked:  women.json

Unstaged changes:
    Modified:   analysis/isoform_switch_analyzer.Rmd
    Modified:   analysis/linear_models.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/anndatar.Rmd) and HTML (docs/anndatar.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd b345a9c Dave Tang 2026-02-16 Using the anndataR package

The anndataR package brings the AnnData data structure into R:

anndataR provides a native R implementation of the AnnData data model, enabling R users to read, write, manipulate, and convert .h5ad files without requiring Python dependencies for core operations.

The AnnData format is the standard container for single-cell genomics data in the Python/scverse ecosystem (used by scanpy, scvi-tools, etc.). anndataR bridges the gap between R and Python single-cell workflows by providing bidirectional conversion with both SingleCellExperiment and Seurat objects.

Installation

Install from Bioconductor.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("anndataR")

Optional dependencies for reading/writing .h5ad files and converting to other formats.

BiocManager::install("rhdf5")
BiocManager::install("SingleCellExperiment")
install.packages("Seurat")
install.packages("SeuratObject")

Package

Load package.

packageVersion("anndataR")
[1] '1.0.1'
suppressPackageStartupMessages(library(anndataR))

The AnnData data model

AnnData stores single-cell data in a structured format with nine slots:

Slot Description
X Primary expression matrix (observations x variables, i.e. cells x genes)
obs Observation (cell) metadata as a data.frame
var Variable (gene) metadata as a data.frame
obs_names Character vector of cell identifiers
var_names Character vector of gene identifiers
layers Named list of alternative matrices (same dimensions as X)
obsm Named list of multi-dimensional observation annotations (e.g. embeddings)
varm Named list of multi-dimensional variable annotations (e.g. loadings)
obsp Named list of pairwise observation matrices (e.g. cell-cell distances)
varp Named list of pairwise variable matrices
uns Arbitrary unstructured metadata

A key difference from R conventions: AnnData stores matrices as observations x variables (cells x genes), while SingleCellExperiment and Seurat store them as variables x observations (genes x cells). anndataR handles this transposition automatically during conversions.

Creating an AnnData object

Use the AnnData() constructor to create an in-memory AnnData object.

n_obs <- 100
n_vars <- 50

set.seed(1984)
counts <- matrix(rpois(n_obs * n_vars, lambda = 5), nrow = n_obs)
rownames(counts) <- paste0("cell_", seq_len(n_obs))
colnames(counts) <- paste0("gene_", seq_len(n_vars))

adata <- AnnData(
  X = counts,
  obs = data.frame(
    row.names = paste0("cell_", seq_len(n_obs)),
    cell_type = factor(sample(c("T cell", "B cell", "Monocyte"), n_obs, replace = TRUE)),
    total_counts = rowSums(counts),
    n_genes = rowSums(counts > 0)
  ),
  var = data.frame(
    row.names = paste0("gene_", seq_len(n_vars)),
    gene_name = paste0("Gene", seq_len(n_vars)),
    highly_variable = sample(c(TRUE, FALSE), n_vars, replace = TRUE)
  )
)

adata
InMemoryAnnData object with n_obs × n_vars = 100 × 50
    obs: 'cell_type', 'total_counts', 'n_genes'
    var: 'gene_name', 'highly_variable'

Exploring the object

Access the dimensions and slot keys.

dim(adata)
[1] 100  50
adata$n_obs
function () 
{
    nrow(self$obs)
}
<environment: 0x56491f221560>
adata$n_vars
function () 
{
    nrow(self$var)
}
<environment: 0x56491f221560>

Observation names and variable names.

head(adata$obs_names)
[1] "cell_1" "cell_2" "cell_3" "cell_4" "cell_5" "cell_6"
head(adata$var_names)
[1] "gene_1" "gene_2" "gene_3" "gene_4" "gene_5" "gene_6"

Cell metadata stored in obs.

head(adata$obs)
       cell_type total_counts n_genes
cell_1    B cell          259      50
cell_2  Monocyte          250      49
cell_3    T cell          256      50
cell_4    B cell          229      50
cell_5    B cell          259      50
cell_6    B cell          253      50
table(adata$obs$cell_type)

  B cell Monocyte   T cell 
      39       34       27 

Gene metadata stored in var.

head(adata$var)
       gene_name highly_variable
gene_1     Gene1           FALSE
gene_2     Gene2            TRUE
gene_3     Gene3            TRUE
gene_4     Gene4           FALSE
gene_5     Gene5            TRUE
gene_6     Gene6            TRUE
sum(adata$var$highly_variable)
[1] 27

The expression matrix stored in X has cells as rows and genes as columns.

dim(adata$X)
[1] 100  50
adata$X[1:5, 1:5]
       gene_1 gene_2 gene_3 gene_4 gene_5
cell_1      6      8      5      4      4
cell_2      4      6      7      6      7
cell_3      4      3      3      7      5
cell_4      4      4      3      5      3
cell_5      6      3      6      7      3

Adding layers

Layers store alternative representations of the data with the same dimensions as X. A common use case is storing raw counts alongside normalised values.

log_norm <- log1p(sweep(counts, 1, rowSums(counts), "/") * 1e4)

adata$layers <- list(
  log_norm = log_norm
)

adata$layers_keys
function () 
{
    names(self$layers)
}
<environment: 0x56491f221560>
adata$layers[["log_norm"]][1:5, 1:5]
         gene_1   gene_2   gene_3   gene_4   gene_5
cell_1 5.449579 5.736186 5.268117 5.046261 5.046261
cell_2 5.081404 5.484797 5.638355 5.484797 5.638355
cell_3 5.057837 4.772272 4.772272 5.614724 5.279708
cell_4 5.168621 5.168621 4.882835 5.390626 4.882835
cell_5 5.449579 4.760721 5.449579 5.603116 4.760721

Adding embeddings

Dimensionality reductions are stored in obsm (one entry per cell). Let’s compute a simple PCA and store it.

pca_result <- prcomp(log_norm, center = TRUE, scale. = TRUE, rank. = 20)

adata$obsm <- list(
  X_pca = pca_result$x
)

adata$obsm_keys
function () 
{
    names(self$obsm)
}
<environment: 0x56491f221560>
dim(adata$obsm[["X_pca"]])
[1] 100  20

Visualise the first two principal components.

pca_df <- data.frame(
  PC1 = pca_result$x[, 1],
  PC2 = pca_result$x[, 2],
  cell_type = adata$obs$cell_type
)

plot(
  pca_df$PC1, pca_df$PC2,
  col = as.integer(pca_df$cell_type),
  pch = 16,
  xlab = "PC1",
  ylab = "PC2",
  main = "PCA of simulated data"
)
legend("topright", levels(pca_df$cell_type), col = seq_along(levels(pca_df$cell_type)), pch = 16)

Since the data is randomly generated, we do not expect the cell types to separate.

Adding variable loadings

Gene loadings from PCA can be stored in varm.

adata$varm <- list(
  PCs = pca_result$rotation
)

adata$varm_keys
function () 
{
    names(self$varm)
}
<environment: 0x56491f221560>
dim(adata$varm[["PCs"]])
[1] 50 20

Unstructured metadata

The uns slot stores arbitrary metadata such as colour palettes, analysis parameters, or summary statistics.

adata$uns <- list(
  analysis_date = Sys.Date(),
  cell_type_colours = c("T cell" = "steelblue", "B cell" = "tomato", "Monocyte" = "forestgreen"),
  pca_variance = summary(pca_result)$importance[2, 1:5]
)

adata$uns_keys
function () 
{
    names(self$uns)
}
<environment: 0x56491f221560>
adata$uns[["cell_type_colours"]]
       T cell        B cell      Monocyte 
  "steelblue"      "tomato" "forestgreen" 

Subsetting

AnnData objects support subsetting with [ using logical, numeric, or character indices. Subsetting creates a view that references the parent object without copying data.

t_cells <- adata[adata$obs$cell_type == "T cell", ]
t_cells
View of InMemoryAnnData object with n_obs × n_vars = 27 × 50
    obs: 'cell_type', 'total_counts', 'n_genes'
    var: 'gene_name', 'highly_variable'
    uns: 'analysis_date', 'cell_type_colours', 'pca_variance'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'log_norm'
small <- adata[1:10, 1:5]
dim(small)
[1] 10  5
small$X
        gene_1 gene_2 gene_3 gene_4 gene_5
cell_1       6      8      5      4      4
cell_2       4      6      7      6      7
cell_3       4      3      3      7      5
cell_4       4      4      3      5      3
cell_5       6      3      6      7      3
cell_6       7      3      4     11      3
cell_7       1      2      3      7      3
cell_8       5      5      9      5      6
cell_9       7      4     11      3      4
cell_10      3     10      6      3      8
selected <- adata[c("cell_1", "cell_2"), c("gene_1", "gene_10", "gene_50")]
dim(selected)
[1] 2 3
selected$obs
       cell_type total_counts n_genes
cell_1    B cell          259      50
cell_2  Monocyte          250      49

Subset to highly variable genes.

hv_genes <- adata$var_names[adata$var$highly_variable]
adata_hv <- adata[, hv_genes]
dim(adata_hv)
[1] 100  27

Reading h5ad files

anndataR reads (and writes) .h5ad files using Bioconductor’s {rhdf5} package natively, without requiring Python.

pbmc3k <- read_h5ad("data/pbmc3k.h5ad")
pbmc3k
InMemoryAnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

Converting to SingleCellExperiment

anndataR provides direct conversion to SingleCellExperiment objects, which are widely used in Bioconductor single-cell workflows.

suppressPackageStartupMessages(library(SingleCellExperiment))

sce <- pbmc3k$as_SingleCellExperiment()
sce
class: SingleCellExperiment 
dim: 32738 2700 
metadata(0):
assays(1): X
rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
rowData names(1): gene_ids
colnames(2700): AAACATACAACCAC-1 AAACATTGAGCTAC-1 ... TTTGCATGAGAGGC-1
  TTTGCATGCCTCAC-1
colData names(0):
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

Note that the matrix is transposed: SingleCellExperiment stores genes as rows and cells as columns.

dim(sce)
[1] 32738  2700
assayNames(sce)
[1] "X"
head(colData(sce))
DataFrame with 6 rows and 0 columns

Converting to Seurat object

suppressPackageStartupMessages(library(Seurat))

seurat_obj <- pbmc3k$as_Seurat()
Warning: No "counts" or "data" layer found in `names(layers_mapping)`, this may lead to
unexpected results when using the resulting <Seurat> object.
Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')
seurat_obj
An object of class Seurat 
32738 features across 2700 samples within 1 assay 
Active assay: RNA (32738 features, 0 variable features)
 1 layer present: X

Summary

anndataR provides a native R implementation of the AnnData data model that:

  • Reads and writes .h5ad files without Python via {rhdf5}
  • Converts bidirectionally with SingleCellExperiment and Seurat
  • Supports in-memory, HDF5-backed, and reticulate-based backends
  • Creates memory-efficient views when subsetting
  • Replaces older packages (anndata, zellkonverter, h5ad) with a single unified solution

This makes it straightforward to share single-cell datasets between R and Python workflows.

Further reading


sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] Seurat_5.4.0                SeuratObject_5.3.0         
 [3] sp_2.2-1                    SingleCellExperiment_1.32.0
 [5] SummarizedExperiment_1.40.0 Biobase_2.70.0             
 [7] GenomicRanges_1.62.1        Seqinfo_1.0.0              
 [9] IRanges_2.44.0              S4Vectors_0.48.0           
[11] BiocGenerics_0.56.0         generics_0.1.4             
[13] MatrixGenerics_1.22.0       matrixStats_1.5.0          
[15] anndataR_1.0.1              workflowr_1.7.2            

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3     rstudioapi_0.18.0      jsonlite_2.0.0        
  [4] magrittr_2.0.4         spatstat.utils_3.2-1   farver_2.1.2          
  [7] rmarkdown_2.30         fs_1.6.6               vctrs_0.7.1           
 [10] ROCR_1.0-12            spatstat.explore_3.7-0 htmltools_0.5.9       
 [13] S4Arrays_1.10.1        Rhdf5lib_1.32.0        SparseArray_1.10.8    
 [16] rhdf5_2.54.1           sass_0.4.10            sctransform_0.4.3     
 [19] parallelly_1.46.1      KernSmooth_2.23-26     bslib_0.10.0          
 [22] htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9            
 [25] plotly_4.12.0          zoo_1.8-15             cachem_1.1.0          
 [28] whisker_0.4.1          igraph_2.2.2           mime_0.13             
 [31] lifecycle_1.0.5        pkgconfig_2.0.3        Matrix_1.7-4          
 [34] R6_2.6.1               fastmap_1.2.0          fitdistrplus_1.2-6    
 [37] future_1.69.0          shiny_1.12.1           digest_0.6.39         
 [40] patchwork_1.3.2        ps_1.9.1               tensor_1.5.1          
 [43] rprojroot_2.1.1        RSpectra_0.16-2        irlba_2.3.7           
 [46] progressr_0.18.0       spatstat.sparse_3.1-0  polyclip_1.10-7       
 [49] httr_1.4.8             abind_1.4-8            compiler_4.5.2        
 [52] S7_0.2.1               fastDummies_1.7.5      MASS_7.3-65           
 [55] DelayedArray_0.36.0    tools_4.5.2            lmtest_0.9-40         
 [58] otel_0.2.0             httpuv_1.6.16          future.apply_1.20.1   
 [61] goftest_1.2-3          glue_1.8.0             callr_3.7.6           
 [64] nlme_3.1-168           rhdf5filters_1.22.0    promises_1.5.0        
 [67] grid_4.5.2             Rtsne_0.17             getPass_0.2-4         
 [70] cluster_2.1.8.1        reshape2_1.4.5         spatstat.data_3.1-9   
 [73] gtable_0.3.6           tidyr_1.3.2            data.table_1.18.2.1   
 [76] XVector_0.50.0         spatstat.geom_3.7-0    RcppAnnoy_0.0.23      
 [79] ggrepel_0.9.6          RANN_2.6.2             pillar_1.11.1         
 [82] stringr_1.6.0          spam_2.11-3            RcppHNSW_0.6.0        
 [85] later_1.4.6            splines_4.5.2          dplyr_1.2.0           
 [88] lattice_0.22-7         deldir_2.0-4           survival_3.8-3        
 [91] tidyselect_1.2.1       miniUI_0.1.2           pbapply_1.7-4         
 [94] knitr_1.51             git2r_0.36.2           gridExtra_2.3         
 [97] scattermore_1.2        xfun_0.56              stringi_1.8.7         
[100] lazyeval_0.2.2         yaml_2.3.12            evaluate_1.0.5        
[103] codetools_0.2-20       tibble_3.3.1           cli_3.6.5             
[106] uwot_0.2.4             xtable_1.8-4           reticulate_1.45.0     
[109] processx_3.8.6         jquerylib_0.1.4        Rcpp_1.1.1            
[112] spatstat.random_3.4-4  globals_0.19.0         png_0.1-8             
[115] spatstat.univar_3.1-6  parallel_4.5.2         ggplot2_4.0.2         
[118] dotCall64_1.2          listenv_0.10.0         viridisLite_0.4.3     
[121] scales_1.4.0           ggridges_0.5.7         purrr_1.2.1           
[124] rlang_1.1.7            cowplot_1.2.0