Last updated: 2023-09-21
Checks: 1 1
Knit directory: snakemake_tutorial/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown is untracked by Git. To know which version of the R
Markdown file created these results, you’ll want to first commit it to
the Git repo. If you’re still working on the analysis, you can ignore
this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 8f3e1ea. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rproj.user/
Untracked files:
Untracked: analysis/basics.Rmd
Unstaged changes:
Modified: analysis/_site.yml
Modified: analysis/index.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish()
to start tracking its development.
In this section, we will learn how to use Snakemake to execute commands.
Goals for this session:
Inside the data/
folder you will see three files.
ls data/
chr20.vcf.gz chr21.vcf.gz chr22.vcf.gz
These are vcf files which contain genotype information for 100 people from 1000 Genomes phase 3. As you might guess from the file names, each file corresponds to one chromosome for chromosomes 20, 21, and 22. We can use bcftools to summarize the contents of these files. The command below will generate some summaries and write them to a file. Run this command in your tutorial directory.
bcftools stats data/chr22.vcf.gz > chr22_stats.txt
If we take a look at this file, we find out that
data/chr22.vcf.gz
contains 100 samples and about a million
variants. We also have lots of other information but this isn’t that
important for our purposes.
We now want to write a rule so that Snakemake could execute this
command for us. The input for the rule is the data file and the output
is the file chr22_stats.txt
. Create a new file in your
tutorial directory called Snakefile
and open it. I like to
use Vim so I would type vim Snakefile
. If you are using
nano
, you could type nano Snakefile
. If you
are on the GreatLakes web interface, click on “+ New File” and then
click on Edit in the dropdown menu. Put the text below into your file
and then save and exit.
rule get_stats22:
input: "data/chr22.vcf.gz"
output: "results/chr22_stats.txt"
shell: "mkdir -p results; bcftools stats data/chr22.vcf.gz > results/chr22_stats.txt"
This rule lists the input, output, and action (shell
)
for the process of using bcftools to summarize the vcf file. Our rule
differs slightly from the command we used in the previous section
because it puts the output into a directory called
results/
.
Workflow Tip: Organize the files in your workflow
into subdirectories. I like to keep the Snakefile and sometimes a few
helper files at the top level. Generally, I like to have sub-directories
called code
where I put scripts (could be called
R
or scripts
etc), data/
, and
results/
. You can choose a different organizational scheme
that works for you but it is nice to have a system.
Activate the snakemake conda environment by typing
conda activate snakemake
Now we are ready to run Snakemake for the first time. Type
snakemake -j 1 -c 1 -p
The -j 1
and -c 1
flags tell Snakemake to
run one job at a time and to use one core. The -p
flag
tells Snakemake to print out the shell commands it will use before it
runs them. We don’t need to tell it anything else because we named our
Snakefile Snakefile
which is the default name that
Snakemake will look for. We could have named it anything else but then
we would need to use the -s
flag to let Snakemake know what
it is called. For example, if you named it
my_first_snakefile
, you could use the command
snakemake -j 1 -c 1 -p -s my_first_snakefile
The output you get from the previous command should look like
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
----------- ------- ------------- -------------
get_stats22 1 1 1
total 1 1 1
Select jobs to execute...
[Thu Sep 21 16:52:56 2023]
rule get_stats22:
input: data/chr22.vcf.gz
output: results/chr22_stats.txt
jobid: 0
reason: Missing output files: results/chr22_stats.txt
resources: tmpdir=/tmp
mkdir -p results; bcftools stats data/chr22.vcf.gz > results/chr22_stats.txt
[Thu Sep 21 16:53:00 2023]
Finished job 0.
1 of 1 steps (100%) done
Complete log: .snakemake/log/2023-09-21T165255.457024.snakemake.log
In this output, Snakemake is telling us what is happening. First, it
identifies a target file and then uses our rules to build a DAG of jobs
to create that file. In our case, it identifies
results/chr22_stats.txt
as the target file (more on why it
chose this coming) and then identifies that it needs to execute one rule
to create this file. It then executes the necessary rule and writes some
inforamtion to a log file.
-n
)A very useful command line option for Snakemake is -n
.
This option will cause Snakemake to only do the first steps of
identifying the target, building the DAG, and making the plan of jobs to
execute without actually executing any of the jobs. I almost always run
Snakemake with -n
before running it “for real” because this
can help me see that Snakemake is going to do what I thought it would
do. Let’s try this now.
snakemake -n -p
We can omit the -j 1 -c 1
options because these only
relate to job execution. You should see
Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).
Snakemake says there is nothing to do because the target file already
exists and the code hasn’t changed since the last time the target file
was created. We will talk more about when Snakemake will re-run a rule
in future sections. For now, delete results/chr22_stats.txt
and re-run the dry-run command above. You should now see different
output indicating that Snakemake will run the rule again.
In our file, we didn’t explicitly tell Snakemake what file to create, it just guessed. You might imagine that in a more complicated workflow, we will probably need to be explicit about what we want. Snakemake has a hierarchy of places it will look to decide what the target is. To see this in action, lets add a second rule to our Snakefile.
rule get_stats21:
input: "data/chr21.vcf.gz"
output: "results/chr21_stats.txt"
shell: "mkdir -p results; bcftools stats data/chr21.vcf.gz > results/chr21_stats.txt"
Hierarchy for Determining Target File(s):
snakemake
command.default_target
.From this hierarchy, if we don’t make any further changes to the Snakefile and run the command
snakemake -n -p
which file do you think will be the target file? Check your answer by running a dry-run.
The command
snakemake -n -p results/chr21_stats.txt
specifies the target files results/chr21_stats.txt
. We
could also have multiple target files such as
snakemake -n -p results/chr22_stats.txt results/chr21_stats.txt
Instead of using the file name, we could also use the name of the rule. Using
snakemake -n -p get_stats21
indicates that our target file is the output of the rule
get_stats21
. This specification will take precedence over
any other way of indicating the target.
Try running
snakemake -n -p results/chr19_stats.txt
What happens?
default_target
Any rule can be a target rule as long as that rule does not contain
any wildcards (we will learn about these next). To indicate that a rule
is a target add the line default_target: True
to the rule
before the shell:
line. For example, modify the
get_stats21
rule to read
rule get_stats21:
input: "data/chr21.vcf.gz"
output: "results/chr21_stats.txt"
default_target: True
shell: "mkdir -p results; bcftools stats data/chr21.vcf.gz > results/chr21_stats.txt"
If you run
snakemake -n -p
You should see that results/chr21_stats.txt
is the
target.
I believe that if multiple rules are specified with
default_target: True
, Snakemake will pick the last one. You
should not do this as it will make your workflow confusing.
The most common way to use Snakemake is to allow it to execute the
first rule as the default. Often, this rule is treated specially and
given the name all
. The all
rule has an input
but no output or shell line. Add these lines to the top
of your Snakefile and remove any designations of
default_target: True
that you added in the previous
section.
rule all:
input: results/chr22_stats.txt, results/chr21_stats.txt
This will make the files listed in input
for rule
all
the target files.
At the end of this section, you Snakefile should look like the file
in code/1.Snakefile
rule all:
input: "results/chr21_stats.txt", "results/chr22_stats.txt"
rule get_stats22:
input: "data/chr22.vcf.gz"
output: "results/chr22_stats.txt"
shell: "mkdir -p results; bcftools stats data/chr22.vcf.gz > results/chr22_stats.txt"
rule get_stats21:
input: "data/chr21.vcf.gz"
output: "results/chr21_stats.txt"
shell: "mkdir -p results; bcftools stats data/chr21.vcf.gz > results/chr21_stats.txt"