Pipelines

Last updated: 2020-10-21

Checks: 7 0

Knit directory: rr_tools/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201021)

The command set.seed(20201021) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 3e16a7b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3e16a7b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  README.html
    Untracked:  analysis/exercise.rmd
    Untracked:  figure/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/pipelines.rmd) and HTML (docs/pipelines.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	3e16a7b	jean997	2020-10-21	add pipelines, programming, workbooks

Introduction

A pipeline is just a series of analysis steps that need to be completed to get to a final result. The first time you build a pipeline you might just have a directory with some code and a Readme in it that says

Step 1: Run script a.sh to generate file A
Step 2: Run script b.R to generate files B1 to B100
Step 3: Run script final.py with A and B1 to B100 to generate final_output.tsv

This is an ok way to do it! You have documentation that tells you what you did and all the code exists in one directory. Chances are, if you had to, you could recreate final_output.tsv, especially if you saved some info about what package versions you used.

What can go wrong why do it differently?

Your colleague would like to use your pipeline but finds it complicated. Also they would like to use slightly different parameters in one of the scripts. You could send them an email that says “Here is the code, you just need to change lines 7 and 25 of a.sh and line 15 of b.R.” but there is a lot of room for error there.
You accidentally modify an file A and don’t notice before you regenerate your final result.
You edit one of your scripts after generating the result and forget to rerun the pipeline.

There are pipeline tools that can help you avoid at least the first two of these problems and also do a lot of the overhead of managing job submission to a cluster. I have only used Snakemake which I will describe here but there are other options.

I will also mention a specialized tool, DSC which is especially for running simulations and makes it easy to track simulated data sets and add replicates or new analysis methods. DSC is also one of my favorite things. I highly recommend giving it a try.

Snakemake

Snakemake is a pipeline building tool that is based on the syntax ideas of a make file. If you have never written a make file, don’t worry, you don’t need to be able to do that to do this. Learning to use Snakemake is definitely a bit of a time investment but it is worth it. For me, here are the major up sides:

I don’t have to manage cluster submission. I only need to know how to run each step and the inputs and outputs of each step. Snakemake will figure out the order, what can be run simultaneously and will submit jobs to a compute cluster with different amounts of memory or time constraints.
I can restart a paused analysis from the middle without worrying about which files are ok. Snakemake will check time stamps and make sure that none of the upstream files have changed before running later jobs.
It makes an analysis easy to customize. You can provide options in a config file like output directory or input data or parameters. So you can use the same code to run a lot of different analyses by just making a new config file.

Here is an example of a pipeline that I wrote for performing Mendelian randomization analysis for user specified pairs of traits to give a sense of what it looks like. The file Snakefile is the pipeline while config.yaml and cluster.yaml specify analysis parameters and how much memory/time/cores each step needs respectively.

To start to get an idea of how it works start here and then look at the full tutorial.

DSC

DSC is a tool created by Gao Wang and Matthew Stephens. It stands for “dynamic statistical comparisons” and it is a tool for running simulations. Like Snakemake, DSC takes a bit of effort to learn. Once that effort is put in though the payoff is immediate. You never have to write another loop function to run simulations for you. DSC also takes care of setting seeds for each simulation so every analysis is reproducible. Re-running the same DSC job should give exactly the same results.

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       rstudioapi_0.11  whisker_0.4      knitr_1.30      
 [5] magrittr_1.5     R6_2.4.1         rlang_0.4.7      stringr_1.4.0   
 [9] tools_4.0.3      xfun_0.18        git2r_0.27.1     htmltools_0.5.0 
[13] ellipsis_0.3.1   rprojroot_1.3-2  yaml_2.2.1       digest_0.6.25   
[17] tibble_3.0.3     lifecycle_0.2.0  crayon_1.3.4     later_1.1.0.1   
[21] vctrs_0.3.4      promises_1.1.1   fs_1.5.0         glue_1.4.2      
[25] evaluate_0.14    rmarkdown_2.3    stringi_1.5.3    compiler_4.0.3  
[29] pillar_1.4.6     backports_1.1.10 httpuv_1.5.4     pkgconfig_2.0.3

Pipelines

Jean Morrison

2020-10-21

Introduction

What can go wrong why do it differently?

Snakemake

DSC