Last updated: 2026-04-27
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 8709228. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rproj.user/
Ignored: data/1M_neurons_filtered_gene_bc_matrices_h5.h5
Ignored: data/293t/
Ignored: data/293t_3t3_filtered_gene_bc_matrices.tar.gz
Ignored: data/293t_filtered_gene_bc_matrices.tar.gz
Ignored: data/5k_Human_Donor1_PBMC_3p_gem-x_5k_Human_Donor1_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/5k_Human_Donor4_PBMC_3p_gem-x_5k_Human_Donor4_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/97516b79-8d08-46a6-b329-5d0a25b0be98.h5ad
Ignored: data/Parent_SC3v3_Human_Glioblastoma_filtered_feature_bc_matrix.tar.gz
Ignored: data/brain_counts/
Ignored: data/cl.obo
Ignored: data/cl.owl
Ignored: data/jurkat/
Ignored: data/jurkat:293t_50:50_filtered_gene_bc_matrices.tar.gz
Ignored: data/jurkat_293t/
Ignored: data/jurkat_filtered_gene_bc_matrices.tar.gz
Ignored: data/pbmc20k/
Ignored: data/pbmc20k_seurat/
Ignored: data/pbmc3k.csv
Ignored: data/pbmc3k.csv.gz
Ignored: data/pbmc3k.h5ad
Ignored: data/pbmc3k/
Ignored: data/pbmc3k_bpcells_mat/
Ignored: data/pbmc3k_export.mtx
Ignored: data/pbmc3k_matrix.mtx
Ignored: data/pbmc3k_seurat.rds
Ignored: data/pbmc4k_filtered_gene_bc_matrices.tar.gz
Ignored: data/pbmc_1k_v3_filtered_feature_bc_matrix.h5
Ignored: data/pbmc_1k_v3_raw_feature_bc_matrix.h5
Ignored: data/refdata-gex-GRCh38-2020-A.tar.gz
Ignored: data/seurat_1m_neuron.rds
Ignored: data/t_3k_filtered_gene_bc_matrices.tar.gz
Ignored: r_packages_4.5.2/
Untracked files:
Untracked: .claude/
Untracked: CLAUDE.md
Untracked: analysis/.claude/
Untracked: analysis/aucc.Rmd
Untracked: analysis/bimodal.Rmd
Untracked: analysis/bioc.Rmd
Untracked: analysis/bioc_scrnaseq.Rmd
Untracked: analysis/chick_weight.Rmd
Untracked: analysis/likelihood.Rmd
Untracked: analysis/modelling.Rmd
Untracked: analysis/sampleqc.Rmd
Untracked: analysis/wordpress_readability.Rmd
Untracked: bpcells_matrix/
Untracked: data/Caenorhabditis_elegans.WBcel235.113.gtf.gz
Untracked: data/GCF_043380555.1-RS_2024_12_gene_ontology.gaf.gz
Untracked: data/SeuratObj.rds
Untracked: data/arab.rds
Untracked: data/astronomicalunit.csv
Untracked: data/davetang039sblog.WordPress.2026-02-12.xml
Untracked: data/femaleMiceWeights.csv
Untracked: data/lung_bcell.rds
Untracked: m3/
Untracked: women.json
Unstaged changes:
Modified: analysis/isoform_switch_analyzer.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/parallel.Rmd) and HTML
(docs/parallel.html) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote), click on the
hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 8709228 | Dave Tang | 2026-04-27 | Add background information |
| html | 3253d99 | Dave Tang | 2026-04-27 | Build site. |
| Rmd | 81dfe8d | Dave Tang | 2026-04-27 | Use params to set number of threads |
| html | 131a202 | Dave Tang | 2024-12-24 | Build site. |
| Rmd | 5871c7c | Dave Tang | 2024-12-24 | Using future_lapply() |
| html | ebb6bb6 | Dave Tang | 2023-12-20 | Build site. |
| Rmd | 7df7cc7 | Dave Tang | 2023-12-20 | MulticoreParam |
| html | c9ebb81 | Dave Tang | 2023-12-20 | Build site. |
| Rmd | 16e8bbf | Dave Tang | 2023-12-20 | Forking is faster than using sockets |
| html | b874727 | Dave Tang | 2023-12-20 | Build site. |
| Rmd | 49be9e8 | Dave Tang | 2023-12-20 | Update |
| html | 0b6b70f | Dave Tang | 2023-07-27 | Build site. |
| Rmd | e6c246e | Dave Tang | 2023-07-27 | Worker environment |
| html | 2f4cb47 | Dave Tang | 2023-07-26 | Build site. |
| Rmd | 9367b80 | Dave Tang | 2023-07-26 | pbapply |
| html | 130d11f | Dave Tang | 2022-11-17 | Build site. |
| Rmd | b2043f3 | Dave Tang | 2022-11-17 | Parallel computation in R |
As stated in the foreach vignette:
Much of parallel computing comes to doing three things: splitting the problem into pieces, executing the pieces in parallel, and combining the results back together.
There are several packages that make it easy to run tasks in parallel:
foreach package and
acts as an interface between foreach and the
parallel package.R’s parallel computing ecosystem has grown over more than two decades, and the variety reflects a mix of historical accident, ecosystem-specific needs, and evolving design ideas:
parallel
package (introduced in R 2.14.0 in 2011) was created by merging two
earlier packages: snow (Simple Network of Workstations,
1999) for socket-based clusters, and multicore (2009) for
forked workers on Unix. Most other packages either build on
parallel or replace it.foreach), some prefer apply-style
functions (*apply, bplapply,
future_lapply), and tidyverse users prefer functional
mapping (map, future_map). Each style has at
least one parallelisation package tailored to it.BiocParallel was
written for Bioconductor, where
workflows commonly run on HPC clusters with schedulers like SLURM or
SGE, and where robust error handling and logging matter.
furrr was written for the tidyverse.
doParallel was written by Revolution Analytics (now
Microsoft) as a backend for foreach.future, BiocParallel) try to abstract this
away; others expose it directly.pbapply),
automatic detection of variables that need to be exported
(future), reproducible parallel RNG, structured error
handling, and so on.A useful mental model is that there are really only two parallelisation mechanisms in R — forking and socket clusters — and most of the packages above are different frontends on top of those mechanisms.
Almost every package in this notebook is ultimately doing one of these two things:
fork() call. Workers share memory with
the parent (copy-on-write), so objects in the parent environment are
automatically available to every worker — no explicit export step is
needed. Forking starts up quickly and avoids data transfer cost, but it
is only available on Unix-like systems (Linux and
macOS) and can be unsafe inside multi-threaded host processes such as
the RStudio GUI.clusterExport,
clusterEvalQ, or — for future-based tools —
automatic globals detection). Socket clusters are slower to start and
have higher communication overhead, but they work on all
platforms including Windows.When you see mclapply, MulticoreParam, or
plan(multicore), that is forking. When you see
makeCluster, SnowParam,
parLapply, or plan(multisession), that is a
socket cluster.
A second useful distinction is between frontends — the API you write your code against — and backends — what actually runs the work:
| Frontend | Typical backends |
|---|---|
parallel::mclapply / parLapply |
fork / socket (built in) |
foreach::%dopar% |
doParallel, doMC, doSNOW,
doFuture, doMPI, … |
BiocParallel::bplapply |
MulticoreParam, SnowParam,
BatchtoolsParam, … |
future.apply::future_lapply,
furrr::future_map |
any future plan: sequential,
multicore, multisession,
cluster |
The same loop body can usually be moved between frontends with little change, but the backend you pick determines startup cost, memory behaviour, OS portability, and how variables are shared with the workers.
system.timeFrom ?proc.time:
The “user time” is the CPU time charged for the execution of user instructions of the calling process.
The “system time” is the CPU time charged for execution by the system on behalf of the calling process.
Elapsed time is the amount of time that has elapsed/passed. The
user and system time while sleeping is close
to zero because the CPU is idly waiting and not executing anything.
system.time(
Sys.sleep(5)
)
user system elapsed
0.000 0.000 5.005
More information is provided on Stack Overflow:
“User CPU time” gives the CPU time spent by the current process (i.e., the current R session and outside the kernel)
“System CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share.
Create a list of 100 data frames each with 5,000 observations across 100 variables.
create_df <- function(n, m, seed = 1984){
set.seed(seed)
as.data.frame(
matrix(
data = rnorm(n = n * m),
nrow = n,
ncol = m
)
)
}
my_list <- lapply(1:100, function(x) create_df(5000, 100, x))
length(my_list)
[1] 100
This is a parameterised notebook; the number of threads used for the code examples is 4.
params$threads
[1] 4
parallelThe parallel package ships with base R and is the
foundation that almost every other package in this notebook builds on.
It exposes both parallelisation mechanisms directly: forking (via
mclapply, mcmapply, etc.) and socket clusters
(via makeCluster, parLapply,
parSapply, etc.). It is low-level and unopinionated — there
is no progress reporting, no automatic globals detection, and error
handling is bare-bones — but it has zero dependencies and is always
available.
Load the parallel package.
library(parallel)
Create a summary of each variable in each data frame without parallelisation.
system.time(
my_sum <- lapply(my_list, summary)
)
user system elapsed
3.394 0.008 3.402
The mclapply function can be used to process a list in
parallel. Note that this function uses forking, which is not available
on Windows.
system.time(
my_sum_mc <- mclapply(my_list, summary, mc.cores = params$threads)
)
user system elapsed
0.021 0.016 1.028
Compare the two summaries.
identical(my_sum, my_sum_mc)
[1] TRUE
Another way to run the jobs in parallel is via sockets. For Windows
users, you will need to use this method for parallelisation. In
addition, you need to use the parLapply function instead of
mclapply.
cl <- makeCluster(params$threads)
system.time(
my_sum_sock <- parLapply(cl, my_list, summary)
)
user system elapsed
0.373 0.073 1.997
stopCluster(cl)
identical(my_sum_mc, my_sum_sock)
[1] TRUE
Note that forking is faster.
If you run the code below:
cl <- makeCluster(params$threads)
system.time(
test <- parLapply(cl, seq_len(params$threads), function(x){
class(my_list)
})
)
stopCluster(cl)
you will get the following error:
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: object 'my_list' not found
This is because each worker is using a different environment. To make
the my_list object available to each worker, we use the
clusterExport() function.
cl <- makeCluster(params$threads)
clusterExport(cl, list("my_list"))
system.time(
test2 <- parSapply(cl, seq_len(params$threads), function(x){
class(my_list)
})
)
user system elapsed
0.001 0.000 0.001
stopCluster(cl)
test2
[1] "list" "list" "list" "list"
pbapplypbapply is primarily a progress-bar package: it provides
drop-in replacements for *apply (pblapply,
pbsapply, etc.) that show a progress bar in interactive
sessions. Parallel support is layered on top — internally it dispatches
to parLapply when given a cluster object and to
mclapply when given an integer worker count — so you get
progress reporting “for free” without changing parallelisation backend.
Reach for it when a long-running job benefits from a visible progress
indicator.
Parallelisation with a progress bar! From the help page of
pblapply:
Parallel processing can be enabled through the cl argument. parLapply is called when cl is a ‘cluster’ object, mclapply is called when cl is an integer. Showing the progress bar increases the communication overhead between the main process and nodes / child processes compared to the parallel equivalents of the functions without the progress bar. The functions fall back to their original equivalents when the progress bar is disabled (i.e. getOption(“pboptions”)$type == “none” or dopb() is FALSE). This is the default when interactive() if FALSE (i.e. called from command line R script).
library(pbapply)
cl <- makeCluster(params$threads)
system.time(
my_sum_pb <- pblapply(my_list, summary, cl = cl)
)
user system elapsed
0.357 0.086 1.989
stopCluster(cl)
identical(my_sum_mc, my_sum_pb)
[1] TRUE
Use mclapply.
system.time(
my_sum_pb_fork <- pblapply(my_list, summary, cl = params$threads)
)
user system elapsed
0.953 0.092 1.024
identical(my_sum_pb, my_sum_pb_fork)
[1] TRUE
doParalleldoParallel is a backend for the foreach
frontend. foreach looks like a for loop but
returns a value (like lapply) and — crucially — is
parallel-backend agnostic: the same
foreach(...) %dopar% { ... } block can be run sequentially,
on a forked or socket cluster (via doParallel), on an MPI
cluster (doMPI), or via the future framework
(doFuture), depending on which backend you register. Reach
for foreach when your loop body is more complex than a
single function application (e.g. multiple result accumulators, custom
.combine strategies, or nested loops with
%:%).
Load the doParallel package.
library(doParallel)
Loading required package: foreach
Loading required package: iterators
Using foreach.
cl <- makeCluster(params$threads)
registerDoParallel(cl)
system.time(
my_sum_dopar <- foreach(l = my_list) %dopar% {
summary(l)
}
)
user system elapsed
0.452 0.084 2.724
stopCluster(cl)
identical(my_sum_mc, my_sum_dopar)
[1] TRUE
BiocParallelBiocParallel is Bioconductor’s unified parallelisation
interface, designed for the kinds of workloads common in genomics:
long-running jobs over large objects, where workers can fail mid-run and
where logs and checkpoints matter. The frontend is a small set of
bp* functions (bplapply,
bpmapply, bpiterate, …) that all take a
BPPARAM argument describing the backend:
MulticoreParam — forking on Unix (falls back to
SnowParam on Windows).SnowParam — socket cluster; can be
type = "PSOCK" (default) or
type = "FORK".SerialParam — no parallelism, useful for
debugging.BatchtoolsParam — submit jobs to HPC schedulers (SLURM,
SGE, LSF, Torque, …) via the batchtools
package.Reach for BiocParallel when you are working inside the
Bioconductor ecosystem, when you need structured error handling and
per-task logging, or when you want the same code to run on a laptop and
a cluster by changing only the BPPARAM.
Load BiocParallel.
library(BiocParallel)
Using bplapply.
param <- SnowParam(workers = params$threads, type = "SOCK")
system.time(
my_sum_bp <- bplapply(my_list, summary, BPPARAM = param)
)
user system elapsed
0.441 0.099 4.626
identical(my_sum_mc, my_sum_bp)
[1] TRUE
Forking.
param <- SnowParam(workers = params$threads, type = "FORK")
system.time(
my_sum_bp_fork <- bplapply(my_list, summary, BPPARAM = param)
)
user system elapsed
0.122 0.145 1.687
identical(my_sum_bp, my_sum_bp_fork)
[1] TRUE
Using MulticoreParam.
param <- MulticoreParam(workers = params$threads, progressbar = FALSE)
system.time(
my_sum_bp_mc <- bplapply(my_list, summary, BPPARAM = param)
)
user system elapsed
0.949 0.140 1.063
identical(my_sum_bp_fork, my_sum_bp_mc)
[1] TRUE
furrrfurrr is a tidyverse-flavoured parallelisation package:
it provides future_map, future_map2,
future_pmap, future_walk, etc., which are
drop-in replacements for the corresponding purrr functions.
Under the hood it uses the future
framework, which means switching between sequential, multicore,
multisession, and cluster execution is just a matter of calling
plan() — your future_map(...) code does not
change. furrr also handles globals (variables captured from
the enclosing environment) and packages automatically, so you rarely
need an explicit clusterExport-style step.
Load required libraries.
library(furrr)
Loading required package: future
library(purrr)
Attaching package: 'purrr'
The following objects are masked from 'package:foreach':
accumulate, when
Map without parallelisation.
system.time(
my_sum_pur <- map(my_list, summary)
)
user system elapsed
3.462 0.023 3.485
identical(my_sum_mc, my_sum_pur)
[1] TRUE
Map with parallelisation.
plan(multisession, workers = params$threads)
system.time(
my_sum_fur <- future_map(my_list, summary)
)
user system elapsed
0.147 0.133 2.149
identical(my_sum_pur, my_sum_fur)
[1] TRUE
futureThe future package, by Henrik Bengtsson, is a unified
abstraction for parallel and distributed computing in R. The central
idea is the future: a placeholder for a value that may still be
computing somewhere — possibly in another process, possibly on another
machine. Code that uses futures is written once and the choice of
where to run it is made separately, by calling
plan():
plan(sequential) — run in the current process (no
parallelism).plan(multicore) — fork on Unix; not available on
Windows or inside RStudio.plan(multisession) — socket cluster of background R
sessions; works everywhere.plan(cluster) — explicit
parallel::makeCluster cluster, including remote machines
over SSH.plan(remote) / plan(batchtools_*) — remote
workers, HPC schedulers, etc.future itself is mostly a low-level engine; in practice
you usually use one of its frontends — future.apply
(*apply-style), furrr
(purrr-style), or doFuture (as a
foreach backend). The companion package future.apply
used below provides parallel versions of lapply,
sapply, vapply, mapply, and
apply.
Load required libraries.
library(future)
library(future.apply)
Map with parallelisation using future_lapply().
plan(multisession, workers = params$threads)
system.time(
my_sum_future_lapply <- future_lapply(my_list, summary)
)
user system elapsed
0.377 0.478 4.760
identical(my_sum, my_sum_future_lapply)
[1] TRUE
So, which package should you use? A few rules of thumb:
BiocParallel; if you’re
already writing tidyverse / purrr code, use
furrr. The cost of consistency with surrounding code is
usually worth more than a small performance difference.lapply, reach for base
parallel. It’s already installed, has no
dependencies, and mclapply is a one-line change from
lapply on Unix.foreach +
doParallel (or doFuture).
foreach shines when you need custom result combination
(.combine), nested loops (%:%), or to
accumulate multiple outputs per iteration.future. Writing your code once against
future_lapply / future_map and switching
backends with plan() is the most flexible option, and the
future framework also handles globals, packages, and
reproducible RNG (future.seed = TRUE) for you.pbapply when you want a progress
bar on an otherwise straightforward parallel
*apply.Two cross-cutting points worth remembering:
lapply already runs in milliseconds,
parallelising it will usually make things slower.
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] future.apply_1.20.1 purrr_1.2.1 furrr_0.3.1
[4] future_1.69.0 BiocParallel_1.44.0 doParallel_1.0.17
[7] iterators_1.0.14 foreach_1.5.2 pbapply_1.7-4
[10] workflowr_1.7.2
loaded via a namespace (and not attached):
[1] sass_0.4.10 stringi_1.8.7 listenv_0.10.0 digest_0.6.39
[5] magrittr_2.0.4 evaluate_1.0.5 fastmap_1.2.0 rprojroot_2.1.1
[9] jsonlite_2.0.0 processx_3.8.6 whisker_0.4.1 ps_1.9.1
[13] promises_1.5.0 httr_1.4.8 codetools_0.2-20 jquerylib_0.1.4
[17] cli_3.6.5 rlang_1.1.7 parallelly_1.46.1 cachem_1.1.0
[21] yaml_2.3.12 otel_0.2.0 tools_4.5.2 httpuv_1.6.16
[25] globals_0.19.0 vctrs_0.7.1 R6_2.6.1 lifecycle_1.0.5
[29] git2r_0.36.2 stringr_1.6.0 fs_1.6.6 pkgconfig_2.0.3
[33] callr_3.7.6 pillar_1.11.1 bslib_0.10.0 later_1.4.6
[37] glue_1.8.0 Rcpp_1.1.1 xfun_0.56 tibble_3.3.1
[41] rstudioapi_0.18.0 knitr_1.51 htmltools_0.5.9 snow_0.4-4
[45] rmarkdown_2.30 compiler_4.5.2 getPass_0.2-4