Pipelines

Introduction

What can go wrong why do it differently?

Snakemake

Last updated: 2020-10-22

Checks: 6 1

Knit directory: rr_tools/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20201021)

The command set.seed(20201021) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 971cbb5

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 971cbb5. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  README.html
    Untracked:  analysis/exercise.rmd
    Untracked:  figure/

Unstaged changes:
    Modified:   analysis/index.Rmd
    Modified:   analysis/pipelines.rmd
    Modified:   analysis/programming.rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/pipelines.rmd) and HTML (docs/pipelines.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	b0d3bd7	jean997	2020-10-21	Build site.
Rmd	3e16a7b	jean997	2020-10-21	add pipelines, programming, workbooks

Introduction

A pipeline is just a series of analysis steps that need to be completed to get to a final result. The first time you build a pipeline you might just have a directory with some code and a Readme in it that says

Step 1: Run script a.sh using data file data.tsv to generate file A
Step 2: Run script b.R to generate files B1 to B100
Step 3: Run script final.py with A and B1 to B100 to generate final_output.tsv

This is an ok way to do it! You have documentation that tells you what you did and all the code exists in one directory. Chances are, if you had to, you could recreate final_output.tsv, especially if you saved some info about what package versions you used.

What can go wrong why do it differently?

Your colleague would like to use your pipeline but finds it complicated. Also they would like to use slightly different parameters in one of the scripts. You could send them an email that says “Here is the code, you just need to change lines 7 and 25 of a.sh and line 15 of b.R.” but there is a lot of room for error there.
You accidentally modify an file A and don’t notice before you regenerate your final result.
You edit one of your scripts after generating the result and forget to rerun the pipeline.

There are pipeline tools that can help you avoid at least the first two of these problems and also do a lot of the overhead of managing job submission to a cluster. I have only used Snakemake which I will describe here but there are other options.

I will also mention a specialized tool, DSC which is especially for running simulations and makes it easy to track simulated data sets and add replicates or new analysis methods. DSC is also one of my favorite things. I highly recommend giving it a try.

Snakemake

Snakemake is a pipeline building tool that is based on the syntax ideas of a make file. If you have never written a make file, don’t worry, you don’t need to be able to do that to do this. Learning to use Snakemake is definitely a bit of a time investment but it is worth it. For me, here are the major up sides:

I don’t have to manage cluster submission. I only need to know how to run each step and the inputs and outputs of each step. Snakemake will figure out the order, which jobs can be run simultaneously, and will submit jobs to a compute cluster with different amounts of memory or time constraints.
I can restart a paused analysis from the middle without worrying about which files are ok. Snakemake will check time stamps and make sure that none of the upstream files have changed before running later jobs.
It makes an analysis easy to customize without modifying the code itself. You can provide options in a config file like output directory or input data or parameters. So you can use the same code to run a lot of different analyses by just making a new config file.

Conceptually, a Snakemake file is just a list of “rules”. Each rule has input files and output files. If I wanted to turn my example above into a Snakemake pipeline, I would end up writing a file that looks something like

rule all: 
    input: "final_output.tsv"
    
rule make_A:
    input: in = "data.tsv"
    output: out = "A"
    shell: "./a.sh {input.in} {output.out}"
    
rule make_B:
    input: in = "A"
    out: expand("B{n}", n = range(1, 101) )
    shell: "Rscript b.R {input.in} {output}"
    
rule make_final:
     input: inA = "A", inB = expand("B{n}", n=range(1, 101))
     output: out = "final_output.tsv"
     shell: "python final.py {input.inA} {input.inB} {output.out}"

Here “rule all” is a special rule that tells Snakemake its ultimate goal. This is the only rule that doesn’t get an output file or instructions about what to do. The other rules each have an input, an output, and then a command to run to generate the output. Reading this, it reads a lot like the Readme up above with a little bit of special syntax but now we have Snakemake taking care of actually running the analysis.

To start to get an idea of how it works start here and then look at the full tutorial.

Here is an example of a pipeline that I wrote that takes input config and cluster files. The file Snakefile is the pipeline while config.yaml and cluster.yaml specify analysis parameters and how much memory/time/cores each step needs respectively.

DSC

DSC is a tool created by Gao Wang and Matthew Stephens. It stands for “dynamic statistical comparisons” and it is a tool for running simulations. Like Snakemake, DSC takes a bit of effort to learn. Once that effort is put in though the payoff is immediate. You never have to write another loop function to run simulations for you. DSC also takes care of setting seeds for each simulation so every analysis is reproducible. Re-running the same DSC job should give exactly the same results.

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5       rstudioapi_0.11  whisker_0.4      knitr_1.30      
 [5] magrittr_1.5     workflowr_1.6.2  R6_2.4.1         rlang_0.4.7     
 [9] stringr_1.4.0    tools_4.0.3      xfun_0.18        git2r_0.27.1    
[13] htmltools_0.5.0  ellipsis_0.3.1   yaml_2.2.1       digest_0.6.25   
[17] rprojroot_1.3-2  tibble_3.0.3     lifecycle_0.2.0  crayon_1.3.4    
[21] later_1.1.0.1    vctrs_0.3.4      promises_1.1.1   fs_1.5.0        
[25] glue_1.4.2       evaluate_0.14    rmarkdown_2.3    stringi_1.5.3   
[29] compiler_4.0.3   pillar_1.4.6     backports_1.1.10 httpuv_1.5.4    
[33] pkgconfig_2.0.3

Pipelines

Jean Morrison

2020-10-22

Introduction

What can go wrong why do it differently?

Snakemake

DSC