Last updated: 2020-08-15
Checks: 7 0
Knit directory: bioinformatics_tips/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200503)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 93ad1d0. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/queuing.Rmd
) and HTML (docs/queuing.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 93ad1d0 | davetang | 2020-08-15 | Update |
html | c6a497c | davetang | 2020-06-21 | Build site. |
Rmd | 3b81b96 | davetang | 2020-06-21 | Queuing systems |
If you will be using a high-performance computer (HPC) cluster for your work you should learn to use a batch-queuing system. These systems are responsible for scheduling, dispatching, and managing the execution of your jobs as well as managing resource allocation.
See comparison of cluster software.
You can configure the server by setting server attributes via the qmgr
command:
Qmgr: set server <attribute> = <value>
The default configuration is shown below.
qmgr
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq enabled = True
set queue workq started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999
PBS.
Specific tasks.
Resources.
sbatch job_script.slurm
squeue
scancel jobid
To list partitions type:
sinfo
It is important to use the correct system and partition for each part of a workflow. To list out the limits of each partition use scontrol
.
scontrol show partition
Use squeue
to display the status of jobs in the local cluster; the larger the priority value, the higher the priority.
squeue
# queue for specific user
squeue -u dtang
# queue for specific partition and sorted by priority
squeue -p workq -S p
Individual job information.
scontrol show job jobid
SLURM needs to know two things from you:
Try to ask for the right amount of resources because:
You cannot submit an application directly to SLURM; SLURM executes on your behalf a list of shell commands. In batch mode, SLURM executes a job script which contains the commands as a bash
or csh
script. In interactive mode, type in the commands just like when you log in.
sbatch
interprets directives in the script, which are written as comments and not executed.
sbatch
command-line argumentsBelow is an example script.
#!/bin/bash -l
#SBATCH --partition=workq
#SBATCH --job-name=hostname
#SBATCH --account=director2120
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00
#SBATCH --export=NONE
hostname
Use --export=NONE
to start with a clean environment, improving reproducibility and avoids contamination of the environment.
Use sbatch
to submit the job.
sbatch hostname.slurm
Parallel applications are launched using srun
.
Use salloc
instead of sbatch
for interactive jobs. Use -p
to request a specific partition for the resource allocation. If not specified, the default behavior is to allow the slurm controller to select the default partition as designated by the system administrator.
salloc --tasks=16 --time=00:10:00
srun make -j 16
When specifying the number of threads, make sure you know the parallel programming model that is used by your library or software. The manner in which you issue the number of tasks may affect how your program runs. The arguments to pay attention to are:
--ntasks=# : Number of "tasks" (use with distributed parallelism).
--ntasks-per-node=# : Number of "tasks" per node (use with distributed parallelism).
--cpus-per-task=# : Number of CPUs allocated to each task (use with shared memory parallelism).
Therefore, using --cpus-per-task
will ensure it gets allocated to the same node, while using --ntasks
can and may allocate it to multiple nodes. You may get by by simply specifying--ntasks
but you should do some testing with a smaller dataset.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --time=04:00:00
#SBATCH --partition=workq
#SBATCH --ntasks=16
#SBATCH --export=NONE
Use job arrays to run embarassingly parallel jobs. In the example below, we are requesting that each array task be allocated 1 CPU (--ntasks=1
) and 4 GB of memory (--mem=4G
) for up to one hour (--time=01:00:00
).
#!/bin/bash -l
#SBATCH --job-name=array
#SBATCH --partition=workq
#SBATCH --account=director2120
#SBATCH --array=0-3
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --export=NONE
FILES=(1.bam 2.bam 3.bam 4.bam)
echo ${FILES[$SLURM_ARRAY_TASK_ID]}
Use bash
arrays to store chromosomes, parameters, etc. for job arrays.
sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DT_0.14 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.0
[5] purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 tibble_3.0.1
[9] ggplot2_3.3.2 tidyverse_1.3.0 workflowr_1.6.2
loaded via a namespace (and not attached):
[1] tidyselect_1.1.0 xfun_0.15 haven_2.3.1 lattice_0.20-41
[5] colorspace_1.4-1 vctrs_0.3.1 generics_0.0.2 htmltools_0.5.0
[9] yaml_2.2.1 blob_1.2.1 rlang_0.4.6 later_1.1.0.1
[13] pillar_1.4.4 withr_2.2.0 glue_1.4.1 DBI_1.1.0
[17] dbplyr_1.4.4 readxl_1.3.1 modelr_0.1.8 lifecycle_0.2.0
[21] cellranger_1.1.0 munsell_0.5.0 gtable_0.3.0 rvest_0.3.5
[25] htmlwidgets_1.5.1 evaluate_0.14 knitr_1.29 crosstalk_1.1.0.1
[29] httpuv_1.5.4 fansi_0.4.1 broom_0.5.6 Rcpp_1.0.4.6
[33] promises_1.1.1 backports_1.1.8 scales_1.1.1 jsonlite_1.7.0
[37] fs_1.4.2 hms_0.5.3 digest_0.6.25 stringi_1.4.6
[41] rprojroot_1.3-2 grid_4.0.0 cli_2.0.2 tools_4.0.0
[45] magrittr_1.5 crayon_1.3.4 whisker_0.4 pkgconfig_2.0.3
[49] ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9
[53] assertthat_0.2.1 rmarkdown_2.3 httr_1.4.1 rstudioapi_0.11
[57] R6_2.4.1 nlme_3.1-148 git2r_0.27.1 compiler_4.0.0