Last updated: 2023-09-21

Checks: 1 1

Knit directory: snakemake_tutorial/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version f5e6747. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/basics.Rmd
    Untracked:  analysis/multistep.Rmd
    Untracked:  analysis/wildcards.Rmd

Unstaged changes:
    Modified:   analysis/index.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.


Introduction

In this section we will expand our workflow to include multiple steps.

In this section:

  1. Add a rule with multiple inputs
  2. Add a rule with multiple outputs, use multiext()
  3. Execute a multi-step workflow
  4. Use Snakemake to draw a DAG to visualize a workflow
  5. Learn how Snakemake determines what to re-run

Multiple Inputs

So far our get_stats rule has one input file and one output file. We could have either multiple inputs or multiple outputs. These can be specified explicitly, or using utilities like expand(). Lets add a rule that uses bcftools to concatonate all three chromosomes together.

rule combine_data:
    input: chr20 = "data/chr20.vcf.gz", chr21 = "data/chr21.vcf.gz", chr22 = "data/chr22.vcf.gz"
    output: out = "data/all.vcf.gz" 
    shell: "bcftools concat -o {output.out} {input.chr20} {input.chr21} {input.chr22}"

Notice that we have named elements in both input and output and we can access these elements using a placeholder by placing the name after the “.”. We can see that this specification might get annoying so fortunately, we can still use the placeholder {input} to access the entire list of inputs. We can also make use of expand() instead of explicitly writing out all of the file names. The rule above is equivalent to

rule combine_data:
    input: expand("data/chr{c}.vcf.gz", c = range(20, 23))
    output: "data/all.vcf.gz"
    shell: "bcftools concat -o {output} {input}"

If you leave rule all as we had it before, what do you think will happen if we run Snakemake?

Multiple Outputs and multiext()

We can also have multiple outputs. Let’s write a rule that takes in the concatonated data and uses Plink2 to perform some basic variant quality control steps. This is a good chance to learn about another utility that functions like expand(). Plink is going to produce a bunch of files that all have the same root and different extensions. To indicate this, we can use multiext().

rule variant_qc:
    input: "data/all.vcf.gz"
    output: multiext("data/all_varqc", ".pgen", ".psam", ".pvar", ".log")
    shell: "plink2 --vcf {input} --set-missing-var-ids '@:#' --snps-only --var-min-qual 95 --geno 0.1 --maf 0.01 --make-pgen --out data/all_varqc "

Finishing the workflow

Let’s add two more rules to make a workflow that computes the genetic principal components of the 100 individuals in our data. To do this, we will use Plink to write out an LD pruned list of variants. Then we will use plink to compute the PCs.

rule ld_prune:
    input: multiext("data/{prefix}", ".pgen", ".psam", ".pvar")
    output: "data/{prefix}.prune.in"
    shell: "plink2 --pfile data/{wildcards.prefix} --indep-pairwise 1000kb 0.01 --out data/{wildcards.prefix}"

rule pca:
    input: data = multiext("data/{prefix}", ".pgen", ".psam", ".pvar"), prune = "data/{prefix}.prune.in"
    output: "results/{prefix}.eigenvec"
    shell: "plink2 --pfile data/{wildcards.prefix} --extract {input.prune} --pca --out results/{wildcards.prefix}"

Here, we have used the wildcard {prefix} to make our rule more generic. Look at the inputs and outputs of these rules and the previous rules. What do you expect {prefix} to be?

Finally, lets add a step that uses R to plot the results. To do this, we will use my favorite way to use R with Snakemake. We want an R script that we can call at the command line that takes in the PCs and outputs a plot. It will help us keep our workflow more generalizeable if we don’t have to hard code these file names in the R script, so lets pass them in as arguments. Add the rule below to the Snakefile:

rule plot_pca:
    input: "results/{prefix}.eigenvec"
    output: "results/{prefix}_pca.png"
    shell: "Rscript code/plot_pcs.R {input} {output}"

This rule makes use of a piece of code that is alread in the code/ folder which looks like this:

library(ggplot2)
library(readr)
    

args <- commandArgs(trailingOnly = TRUE)
input <- args[1]
output <- args[2]
    
pcs <- read_table(input)
my_plot <- ggplot(pcs) + geom_point(aes(x = PC1, y = PC2))
ggsave(my_plot,  file = output, height = 4, width = 5, units = "in", dpi = 300)

The line args <- commandArgs(trailingOnly=TRUE) collects the command line arguments into a variable that we can access in R.

Set the input in rule all to the value that you think will execute the PCA workflow. Test your code by running a dry run and then execute it! Remember to add the -j 1 -c 1 flags when you execute for real.

Visualizing the DAG of jobs

To get to know our workflow better, we can use Snakemake to produce a visual representation of it. To create a png file, use

snakemake --dag  | dot -Tpng > dag.png

Open the file (or move the file to your local computer and then open it). You should see something like this

DAG of workflow
DAG of workflow

Boxes with dashed lines indicate steps that won’t be run if Snakemake were run now and solid lines indicate steps that need to be re-run.

When Rules are Re-Run

By default, Snakemake is lazy (efficient) and does not want to re-do work that has already been done. After determining the target file and the series of steps, Snakemake will check the time stamps of the intermediate products and compare these to each other and the code. If the code hasn’t been modified since files were produced, and not pre-cursors have been changed since files lower down the workflow were created, then there is nothing to do.

One way to trigger a re-run is to modify an upstream file (or to simply update it’s time stamp). To see this use

touch data/all_varqc.pgen

This will update the time stamp of data/all_varqc.pgen to now. If you re-ran Snakemake, which steps do you think it would try to repeat? Check your answer by running a dry-run. However, removing an intermediate file will not trigger a re-run.

Another way to trigger a re-run is to modify the content of the rule in the Snakefile. Note that this does not extend to the content of the scripts called by the rules. So if we modify code/plot_pcs.R, Snakemake will not know that it needs to re-run the plotting step.

There are several command line options that control how Snakemake decides what to re-run. The option -F (for force) forces Snakemake to re-run everything. Usually, this is overkill and you don’t want to do this. The -R option can be used to tell Snakemake to re-run everything from a given rule forward. For example,

snakemake -j 1 -c 1 -R pca

will re-run the pca rule and the plot_pca rule. Finally the --rerun-triggers command line option can be used to modify the behavior of what should be re-run when. For more details on this option, check the Snakemake documentation (this is a bit beyond this tutorial).

Final Snakefile

At the end of this section, your Snakefile should look like this:

rule all:
    input: "results/all_varqc_pca.png" 

rule get_stats:
    input: "data/chr{c}.vcf.gz"
    output: "results/chr{c}_stats.txt"
    shell: "mkdir -p results; bcftools stats {input} > {output}"

rule combine_data:
    input: expand("data/chr{c}.vcf.gz", c = range(20, 23))
    output: "data/all.vcf.gz"
    shell: "bcftools concat -o {output} {input}"

rule variant_qc:
    input: "data/all.vcf.gz"
    output: multiext("data/all_varqc", ".pgen", ".psam", ".pvar", ".log")
    shell: "plink2 --vcf {input} --set-missing-var-ids '@:#' --snps-only --var-min-qual 95 --geno 0.1 --maf 0.01 --make-pgen --out data/all_varqc "


rule ld_prune:
    input: multiext("data/{prefix}", ".pgen", ".psam", ".pvar")
    output: "data/{prefix}.prune.in"
    shell: "plink2 --pfile data/{wildcards.prefix} --indep-pairwise 1000kb 0.01 --out data/{wildcards.prefix}"

rule pca:
    input: data = multiext("data/{prefix}", ".pgen", ".psam", ".pvar"), prune = "data/{prefix}.prune.in"
    output: "results/{prefix}.eigenvec"
    shell: "plink2 --pfile data/{wildcards.prefix} --extract {input.prune} --pca --out results/{wildcards.prefix}"

rule plot_pca:
    input: "results/{prefix}.eigenvec"
    output: "results/{prefix}_pca.png"
    shell: "Rscript code/plot_pcs.R {input} {output}"

This code is in code/3.Snakefile