Last updated: 2020-06-21

Checks: 7 0

Knit directory: bioinformatics_tips/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200503) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3b81b96. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/queuing.Rmd) and HTML (docs/queuing.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 3b81b96 davetang 2020-06-21 Queuing systems

Introduction

If you will be using a high-performance computer (HPC) cluster for your work you should learn to use a batch-queuing system. These systems are responsible for scheduling, dispatching, and managing the execution of your jobs as well as managing resource allocation.

Queuing systems

  • Oracle Grid Engine, previously known as Sun Grid Engine
  • Univa Grid Engine is a batch-queuing system, forked from Sun Grid Engine (SGE)
  • Portable Batch System
    • OpenPBS — original open source version released by MRJ in 1998 (actively developed)
    • TORQUE — a fork of OpenPBS that is maintained by Adaptive Computing Enterprises, Inc. (formerly Cluster Resources, Inc.)
    • PBS Professional (PBS Pro) — the version of PBS offered by Altair Engineering that is dual licensed under an open source and a commercial license.
  • SLURM

See comparison of cluster software.

PBS

PBS.

Specific tasks.

Resources.

SLURM

  • A SLURM partition is a queue
  • A SLURM cluster is all the partitions that are managed by a single SLURM daemon
sbatch job_script.slurm
squeue
scancel jobid

To list partitions type:

sinfo

It is important to use the correct system and partition for each part of a workflow. To list out the limits of each partition use scontrol.

scontrol show partition

Use squeue to display the status of jobs in the local cluster; the larger the priority value, the higher the priority.

squeue

# queue for specific user
squeue -u dtang

# queue for specific partition and sorted by priority
squeue -p workq -S p

Individual job information.

scontrol show job jobid

SLURM needs to know two things from you:

  1. Resource requirement: how many nodes and how long
  2. What to run

Try to ask for the right amount of resources because:

  1. Over-estimating the resources will mean it will take longer to find an available slot.
  2. Under-estimating the time required means the job will get killed.
  3. Under-estimating memory will mean your job will crash.

You cannot submit an application directly to SLURM; SLURM executes on your behalf a list of shell commands. In batch mode, SLURM executes a job script which contains the commands as a bash or csh script. In interactive mode, type in the commands just like when you log in.

sbatch interprets directives in the script, which are written as comments and not executed.

  • Directive lines start with #SBATCH
  • These are equivalent to sbatch command-line arguments
  • Directives are usually more convenient and reproducible than command-line arguments

Below is an example script.

#!/bin/bash -l
#SBATCH --partition=workq
#SBATCH --job-name=hostname
#SBATCH --account=director2120
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00
#SBATCH --export=NONE

hostname

Use --export=NONE to start with a clean environment, improving reproducibility and avoids contamination of the environment.

Use sbatch to submit the job.

sbatch hostname.slurm

Parallel applications are launched using srun.

Use salloc instead of sbatch for interactive jobs. Use -p to request a specific partition for the resource allocation. If not specified, the default behavior is to allow the slurm controller to select the default partition as designated by the system administrator.

salloc --tasks=16 --time=00:10:00
srun make -j 16

When specifying the number of threads, make sure you know the parallel programming model that is used by your library or software. The manner in which you issue the number of tasks may affect how your program runs. The arguments to pay attention to are:

--ntasks=# : Number of "tasks" (use with distributed parallelism).
--ntasks-per-node=# : Number of "tasks" per node (use with distributed parallelism).
--cpus-per-task=# : Number of CPUs allocated to each task (use with shared memory parallelism).

Therefore, using --cpus-per-task will ensure it gets allocated to the same node, while using --ntasks can and may allocate it to multiple nodes. You may get by by simply specifying--ntasks but you should do some testing with a smaller dataset.

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --time=04:00:00
#SBATCH --partition=workq
#SBATCH --ntasks=16
#SBATCH --export=NONE

Use job arrays to run embarassingly parallel jobs. In the example below, we are requesting that each array task be allocated 1 CPU (--ntasks=1) and 4 GB of memory (--mem=4G) for up to one hour (--time=01:00:00).

#!/bin/bash -l
#SBATCH --job-name=array
#SBATCH --partition=workq
#SBATCH --account=director2120
#SBATCH --array=0-3
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --export=NONE

FILES=(1.bam 2.bam 3.bam 4.bam)

echo ${FILES[$SLURM_ARRAY_TASK_ID]}

Use bash arrays to store chromosomes, parameters, etc. for job arrays.