Workflow management system

Last updated: 2020-08-23

Checks: 7 0

Knit directory: bioinformatics_tips/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200503)

The command set.seed(20200503) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: f52afb5

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version f52afb5. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Unstaged changes:
    Modified:   analysis/simulation.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/pipelining.Rmd) and HTML (docs/pipelining.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	f52afb5	davetang	2020-08-23	Update
html	3e6869c	davetang	2020-08-15	Build site.
Rmd	93ad1d0	davetang	2020-08-15	Update
html	e9934e7	davetang	2020-06-07	Security basics
html	b45e21b	davetang	2020-05-23	Build site.
Rmd	3a4d719	davetang	2020-05-23	Workflow management system

A workflow management system (WMS) is software that makes it easier to implement, execute, and manage workflows. The analysis of high-throughput sequencing (HTS) data benefits greatly from WMSs because the processing of raw HTS data requires many sequential steps. Typically in a bioinformatics workflow or pipeline, HTS data is analysed by quality control (QC) tools, preprocessed accordingly, aligned to a reference sequence, and summarised in a manner that is relevant to the biological question.

However, you may be wondering whether it is necessary to use a WMS if you can implement your bioinformatics pipeline using various shell (or similar scripting language) scripts in a relatively short amount of time. Is it worth investing your time learning yet another tool, especially if your interest is in the biology that the data holds? I think so because knowing at least one WMS is a core bioinformatic skill that everyone should possess.

If you’re convinced that you should at least try to use a WMS you may be wondering how hard it is to learn to use one and how long it will take to implement your bioinformatics workflow with a WMS. I am more familiar with the Workflow Description Language (WDL), which is a language that you can use to implement your workflow, so my comments are based on WDL. I don’t think it is hard to learn and to use WDL; I wrote a short blog post on learning WDL, so you can be the judge.

With WDL, you set up each component of your pipeline as individual tasks and each task follows a defined structure; you can also set up specify resource usage per task, i.e. how much resource should be used for this task. You then create a workflow by calling your tasks; you can specify that the output of one task is used as input for the next task. A JSON file is used as a config file for your workflow and will specify parameters and the location of files, such as your input files. Finally, a separate tool called Cromwell can be used to execute your pipeline and it is Cromwell that will handle all the logging, resource management, and pipeline execution. One of the benefits of WDL/Cromwell is that you can execute the same pipeline on different computational infrastructures making your pipeline very portable!

Another WMS is Snakemake, which uses rules to set up each part of your bioinformatics pipeline, akin to tasks in WDL.

rule sort:
    input:
        "test.txt"
    output:
        "test.sorted.txt"
    shell:
        "sort -n {input} > {output}"

If we run the example above using Snakemake, the input file test.txt will get sorted numerically and the output is stored in test.sorted.txt. Typically, you would write a pipeline (a Snakefile) that takes input from a config file (e.g. config.yaml). If you wanted to run the pipeline for a new dataset, you will just need to create a new config file.

Other workflow management systems include Bpipe and Nextflow, which both are based on Groovy. A survey conducted on Twitter has a list of other systems and showed that Snakemake is the most popular. There is a nice discussion on Reddit on the strengths and weakness of different workflow management systems.

Personally I use WDL because that’s what the Broad Institute uses and I wanted to use some of their pipelines. This is another advantage of learning a WMS because it is very likely that somebody has already implemented a similar pipeline. Check out these pipelines in using different WMSs:

Snakemake - https://github.com/snakemake-workflows
Nextflow - https://github.com/nextflow-io/awesome-nextflow
WDL - https://github.com/biowdl

Even if your exact pipeline hasn’t been previously implemented, you probably can still find a similar pipeline that you could modify for your needs.

Lastly, WMSs make it much easier to reproducible your work. TBC.

WDL

Specification - https://github.com/openwdl/wdl/tree/main/versions
Optional inputs - https://github.com/openwdl/wdl/blob/master/versions/1.0/SPEC.md#optional-inputs
true and false - https://github.com/openwdl/wdl/blob/master/versions/1.0/SPEC.md#true-and-false

Workflow management system

Dave Tang

2020-08-23

WDL

Other links