Last updated: 2019-12-05
Checks: 7 0
Knit directory: reproducible_bioinformatics/
This reproducible R Markdown analysis was created with workflowr (version 1.5.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20191203)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Unstaged changes:
Modified: analysis/_site.yml
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 5dc4fe0 | davetang | 2019-12-05 | Build site. |
Rmd | 179c2bb | davetang | 2019-12-05 | wflow_publish(files = c(“analysis/conda.Rmd”, “analysis/docker.Rmd”, “analysis/index.Rmd”, |
html | 9aa9aa4 | davetang | 2019-12-05 | Build site. |
Rmd | ec7204f | davetang | 2019-12-05 | wflow_publish(files = c(“analysis/about.Rmd”, “analysis/conda.Rmd”, “analysis/docker.Rmd”, |
html | 7b114c5 | First Last | 2019-12-04 | Build site. |
Rmd | a4180a4 | First Last | 2019-12-04 | wflow_publish(files = c(“analysis/conda.Rmd”, “analysis/index.Rmd”)) |
This tutorial was adopted from here. The output shown is based on running the commands inside a container using the continuumio/miniconda Docker image. To follow this tutorial, make sure you have Docker installed and use docker pull
to download the latest container.
docker pull continuumio/miniconda3
The objective of the workshop is to demonstrate how Conda can be used to simplify the installation of bioinformatic tools and to create reproducible (and separate) environments.
From the Conda documentation:
Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
If you have run into dependency problems before when trying to install bioinformatic tools, Conda helps deal with this. Furthermore, Conda makes it easier to install and work with incompatible tools.
Anaconda is a distribution of Conda. It is a data science platform that comes with a lot of packages (too many in my opinion).
Miniconda is a minimal installer for Conda. It is a small, bootstrap version of Anaconda that includes only Conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. I prefer using Miniconda and only installing tools that I need.
Bioconda is a distribution of bioinformatics software realised as a channel for the versatile Conda package manager. Conda channels are simply the locations where packages are stored. Most widely used bioinformatic tools are available on the Bioconda channel, which hosts over 6,000 bioinformatics packages.
For this workshop we are using Conda inside a Docker container. Once you have pulled the latest image, run the following command.
# run new container
docker run -it --rm continuumio/miniconda bash
Your command prompt should look something like after running the command above:
(base) root@d470a3e9da91:/#
Like any good tool, if you type conda
without any parameters you will get the usage and all the subcommands.
conda
usage: conda [-h] [-V] command ...
conda is a tool for managing and deploying applications, environments and packages.
Options:
positional arguments:
command
clean Remove unused packages and caches.
config Modify configuration values in .condarc. This is modeled
after the git config command. Writes to the user .condarc
file (/root/.condarc) by default.
create Create a new conda environment from a list of specified
packages.
help Displays a list of available conda commands and their help
strings.
info Display information about current conda install.
init Initialize conda for shell interaction. [Experimental]
install Installs a list of packages into a specified conda
environment.
list List linked packages in a conda environment.
package Low-level conda package utility. (EXPERIMENTAL)
remove Remove a list of packages from a specified conda environment.
uninstall Alias for conda remove.
run Run an executable in a conda environment. [Experimental]
search Search for packages and display associated information. The
input is a MatchSpec, a query language for conda packages.
See examples below.
update Updates conda packages to the latest compatible version.
upgrade Alias for conda update.
optional arguments:
-h, --help Show this help message and exit.
-V, --version Show the conda version number and exit.
conda commands available from other packages:
env
We can find out what version of Conda we are using.
conda --version
conda 4.7.10
To see the full documentation for any command, type the command followed by --help
. For example, to learn about the conda update
command:
conda update --help
We will now make sure that Conda is up to date by using conda update
. Conda will compare versions and let you know what is available to install. It will also tell you about other packages that will be automatically updated or changed with the update.
# when prompted enter "y"
conda update conda
conda --version
conda 4.7.12
You can also update all Conda packages to the latest compatible version.
# don't need to run this
conda update --all
Conda will download and cache temporary files; remember to use conda clean
periodically to clean up temp files.
conda clean -a
Installing bcftools
is just a single command using the Bioconda channel.
conda install -c bioconda bcftools
cd /tmp
wget https://github.com/davetang/learning_vcf_file/raw/master/aln_consensus.bcf
# get all SNPs, ignore the metadata, and view the first two lines
bcftools view -v snps aln_consensus.bcf | grep -v "^#" | head -2
1000000 336 . A G 221.999 . DP=112;VDB=0.756462;SGB=-0.693147;MQ0F=0;AF1=1;AC1=2;DP4=0,0,102,0;MQ=60;FQ=-281.989 GT:PL 1/1:255,255,0
1000000 378 . T C 221.999 . DP=101;VDB=0.704379;SGB=-0.693147;MQ0F=0;AF1=1;AC1=2;DP4=0,0,99,0;MQ=60;FQ=-281.989 GT:PL 1/1:255,255,0
Here’s how you would install bcftools
without Conda.
Using Conda, you can create an isolated environment for your project. An environment is a set of packages that can be used in one or multiple projects. The default environment with Miniconda is the base
environment. I don’t recommend installing all your packages/tools under the same environment.
There are two ways of creating a Conda environment.
environment.yml
).An example of an environment file (environment.yml
) I used for a specific project.
name: new_project
channels:
- bioconda
- anaconda
- conda-forge
- defaults
dependencies:
- fastqc
- multiqc
- cutadapt
- bwa
- samtools
- macs2
- bedtools
- deeptools
- minimap2
- star
- parallel
- idr
Now, let’s use this environment.yml
environment file to install an older version of bwa
in an isolated environment called bwa_old
.
name: bwa_old
channels:
- bioconda
dependencies:
- bwa=0.7.15
Create the environment.
wget https://raw.githubusercontent.com/davetang/reproducible_bioinformatics/master/environment.yml
conda env create --file environment.yml
# check list of environments
conda env list
# conda environments:
#
base * /opt/conda
bwa_old /opt/conda/envs/bwa_old
Activate the environment. The (bwa_old)
in the beginning of the line indicates that we are curently using the bwa_old
Conda environment.
conda activate bwa_old
# your prompt should change to
# (bwa_old) root@d470a3e9da91:/tmp#
bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.15-r1140
Contact: Heng Li <lh3@sanger.ac.uk>
Deactivate the environment.
conda deactivate
# your prompt will change back to
# (base) root@d470a3e9da91:/tmp#
We can create Conda environments by specifying the name, channel, and list of packages within the terminal. In the example below, we are creating the test_env
environment that uses python 2.7 and a list of libraries: numpy
, matplotlib
, pandas
.
conda create -c conda-forge -n test_env python=2.7 numpy matplotlib pandas
Conda will solve any dependencies between the packages like before and create a new environment with those packages. I prefer creating environments using an environment file rather than on the command line.
Before deleting an environment make sure you are not currently using the environment or you will get an error.
conda env remove -n bwa_old
The commands below will list all the packages in the bwa_old
and base
environments. The list will include versions of each package, the specific build, and the channel that the package was downloaded from. conda list
is useful to ensure that you have installed the packages that you desire.
conda list -n bwa_old
# packages in environment at /opt/conda/envs/bwa_old:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
bwa 0.7.15 1 bioconda
libgcc 7.2.0 h69d50b8_2
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
zlib 1.2.11 h7b6447c_3
conda list -n base
# packages in environment at /opt/conda:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
asn1crypto 1.2.0 py27_0
ca-certificates 2019.11.27 0
certifi 2019.11.28 py27_0
cffi 1.13.2 py27h2e261b9_0
chardet 3.0.4 py27_1003
conda 4.7.12 py27_0
conda-package-handling 1.6.0 py27h7b6447c_0
cryptography 2.8 py27h1ba5d50_0
enum34 1.1.6 py27_1
futures 3.3.0 py27_0
idna 2.8 py27_0
ipaddress 1.0.23 py_0
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses 6.1 he6710b0_1
openssl 1.1.1d h7b6447c_3
pip 19.3.1 py27_0
pycosat 0.6.3 py27h14c3975_0
pycparser 2.19 py27_0
pyopenssl 19.1.0 py27_0
pysocks 1.7.1 py27_0
python 2.7.16 h9bab390_0
readline 7.0 h7b6447c_5
requests 2.22.0 py27_0
ruamel_yaml 0.15.46 py27h14c3975_0
setuptools 42.0.2 py27_0
six 1.13.0 py27_0
sqlite 3.30.1 h7b6447c_0
tk 8.6.8 hbc83047_0
tqdm 4.40.0 py_0
urllib3 1.24.2 py27_0
wheel 0.33.6 py27_0
yaml 0.1.7 had09818_2
zlib 1.2.11 h7b6447c_3
Conda packages are files containing a bundle of resources: usually libraries and executables, but not always. In principle, Conda packages can include data, images, notebooks, or other assets.
It is important to be careful when downloading packages and use only trusted sources. Conda forge is a reliable source for many popular python packages. Anaconda Cloud is a package management service that makes it easy to find, access, store and share public and private notebooks, environments, and Conda and PyPI packages. Bioconda is a trusted channel for the conda package manager specialising in bioinformatics software.
If you have a few trusted channels that you prefer to use, you can pre-configure these so that everytime you are creating an environment, you won’t need to explicitly declare the channel.
conda config --add channels conda-forge
conda config --add channels bioconda
After adding channels, you can search for packages.
conda search bwa
Loading channels: done
# Name Version Build Channel
bwa 0.5.9 0 bioconda
bwa 0.5.9 1 bioconda
bwa 0.5.9 ha92aebf_2 bioconda
bwa 0.6.2 0 bioconda
bwa 0.6.2 1 bioconda
bwa 0.6.2 ha92aebf_2 bioconda
bwa 0.7.3a 0 bioconda
bwa 0.7.3a 1 bioconda
bwa 0.7.3a h84994c4_3 bioconda
bwa 0.7.3a h84994c4_4 bioconda
bwa 0.7.3a ha92aebf_2 bioconda
bwa 0.7.4 h84994c4_1 bioconda
bwa 0.7.4 ha92aebf_0 bioconda
bwa 0.7.4 hed695b0_2 bioconda
bwa 0.7.4 hed695b0_3 bioconda
bwa 0.7.8 0 bioconda
bwa 0.7.8 1 bioconda
bwa 0.7.8 h84994c4_3 bioconda
bwa 0.7.8 ha92aebf_2 bioconda
bwa 0.7.8 hed695b0_4 bioconda
bwa 0.7.8 hed695b0_5 bioconda
bwa 0.7.12 0 bioconda
bwa 0.7.12 1 bioconda
bwa 0.7.13 0 bioconda
bwa 0.7.13 1 bioconda
bwa 0.7.15 0 bioconda
bwa 0.7.15 1 bioconda
bwa 0.7.16 pl5.22.0_0 bioconda
bwa 0.7.17 h84994c4_4 bioconda
bwa 0.7.17 h84994c4_5 bioconda
bwa 0.7.17 ha92aebf_3 bioconda
bwa 0.7.17 hed695b0_6 bioconda
bwa 0.7.17 pl5.22.0_0 bioconda
bwa 0.7.17 pl5.22.0_1 bioconda
bwa 0.7.17 pl5.22.0_2 bioconda
Use conda remove
to remove packages.
conda remove bcftools
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.5.0 Rcpp_1.0.3 rprojroot_1.3-2 digest_0.6.22
[5] later_1.0.0 R6_2.4.1 backports_1.1.5 git2r_0.26.1
[9] magrittr_1.5 evaluate_0.14 stringi_1.4.3 rlang_0.4.1
[13] fs_1.3.1 promises_1.1.0 whisker_0.4 rmarkdown_1.17
[17] tools_3.6.1 stringr_1.4.0 glue_1.3.1 httpuv_1.5.2
[21] xfun_0.11 yaml_2.2.0 compiler_3.6.1 htmltools_0.4.0
[25] knitr_1.26