Write a script to easily redo a task

Last updated: 2023-07-11

Checks: 7 0

Knit directory: bioinformatics_tips/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200503)

The command set.seed(20200503) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 6072e0e

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 6072e0e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/compsci.Rmd
    Untracked:  analysis/framework.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/scripting.Rmd) and HTML (docs/scripting.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	6072e0e	Dave Tang	2023-07-11	Seven quick tips for analysis scripts in neuroimaging
html	55bfca0	Dave Tang	2023-06-26	Build site.
Rmd	05556ea	Dave Tang	2023-06-26	Further reading
html	93a67d7	Dave Tang	2023-01-12	Build site.
Rmd	82a105c	Dave Tang	2023-01-12	Scripting
html	78910db	Dave Tang	2022-04-13	Build site.
Rmd	f80e28b	Dave Tang	2022-04-13	Script as much as possible

It may already be obvious that writing a script, i.e. a file that runs a series of commands, is much more preferable than having to type those commands manually. For example, instead of manually typing the commands of various tools that process your data, you saved all those commands into a single file and can simply execute that file, which is called a script, to run all the steps. Now you can easily re-run your analysis.

Having all your commands saved into a script makes it easier to re-run your analysis with new data or new settings too. You could edit your script manually to specify the location of the new data or what you should do is write a script that accepts command line arguments. What this means is that instead of hardcoding some value in your script like /data/gene_exp.csv, you assign it as arguments/parameters to your script. If your script is called summarise.sh, you could write your script so that data is passed via the command line.

./summarise.sh /data/gene_exp.csv

You could also include settings/parameters that can be passed to your script, so you can easily re-run your analysis with different settings.

./summarise.sh /data/gene_exp.csv --alpha 0.5 --beta 3

After you have nicely scripted up this analysis, you start working on scripting up another analysis. However, you realise that some steps in your previous analysis are also needed in this analysis. You could use your previous script as a template and modify it for this analysis. But what you should do is include each step in its own separate command line argument accepting script. This way you don’t have to modify two analysis scripts when you need to make changes to an individual step. It may seem annoying to have to write ten separate scripts for an analysis pipeline that has ten steps. But this makes it much easier to maintain in the future, especially when you start building more and more analysis pipelines.

If you have gone this far to set up your work, you can go a bit further to tie everything together using a workflow management system. The benefits of such systems is that it makes it easier to manage your workflows. For example, you could execute your workflow via a queuing system or Google Cloud. You could set up limits for computational resources, restart jobs, cache results, and more.

Now that everything is nicely automated, it is time to include tests, which makes sure your analysis pipeline generates expected results. This should also be automated by using CI/CD, which means that each time you make a change to your pipeline, another pipeline is automatically run to see if your pipeline is running as expected.

Since everything is so nicely set up, you have more time to do what needs to be done! And it wouldn’t have been possible if you didn’t script everything up.

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.0

loaded via a namespace (and not attached):
 [1] vctrs_0.6.3      httr_1.4.5       cli_3.6.1        knitr_1.42      
 [5] rlang_1.1.1      xfun_0.39        stringi_1.7.12   processx_3.8.1  
 [9] promises_1.2.0.1 jsonlite_1.8.4   glue_1.6.2       rprojroot_2.0.3 
[13] git2r_0.32.0     htmltools_0.5.5  httpuv_1.6.9     ps_1.7.5        
[17] sass_0.4.5       fansi_1.0.4      rmarkdown_2.21   jquerylib_0.1.4 
[21] tibble_3.2.1     evaluate_0.20    fastmap_1.1.1    yaml_2.3.7      
[25] lifecycle_1.0.3  whisker_0.4.1    stringr_1.5.0    compiler_4.3.0  
[29] fs_1.6.2         pkgconfig_2.0.3  Rcpp_1.0.10      rstudioapi_0.14 
[33] later_1.3.0      digest_0.6.31    R6_2.5.1         utf8_1.2.3      
[37] pillar_1.9.0     callr_3.7.3      magrittr_2.0.3   bslib_0.4.2     
[41] tools_4.3.0      cachem_1.0.7     getPass_0.2-2

Write a script to easily redo a task

Dave Tang

2023-07-11

Further reading