Frameworks and guidelines

Last updated: 2023-11-24

Checks: 7 0

Knit directory: bioinformatics_tips/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200503)

The command set.seed(20200503) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: be5acbd

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version be5acbd. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/framework.Rmd) and HTML (docs/framework.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	be5acbd	Dave Tang	2023-11-24	Frameworks

Introduction

I learned bioinformatics many years ago with pretty much no background in computer science, mathematics, or statistics. I learned some maths and stats in high school but that’s about it. Back then a lot of bioinformatic analyses were glued (or duct taped) together using custom Perl scripts. And many Perl scripts were written by people like me, people without any formal training in computer science/software engineering. In addition, at that time there was very little material for learning bioinformatics and many of the references available were technical books written for people with some computing background. What all this meant was that a lot of work was performed on shaky foundations.

Bioinformatics has come a long way from those days. Initiatives such as Software Carpentry have helped many researchers develop good foundations. Bioconda lets you install thousands of software packages related to biomedical research using the conda package manager. This not only simplifies the install process (which sometimes can be a nightmare) but helps reproducibility. Some journals such as PLOS Computational Biology have a policy that code must be publicly shared rather than custom Perl scripts available upon request.

Difference between framework and guideline.

Guideline:

A non-specific rule or principle that provides direction to action or behaviour.

Framework:

(figuratively, especially in, computing) A basic conceptual structure.

The Cambridge dictionary definition of a framework is:

a supporting structure around which something can be built.

This is just as relevant to bioinformatics as it is to the construction of a building. In the case of bioinformatics, the “supporting structure” is basically an organised set of code or guidelines that follows a defined specification.

For example, if you are going to build an analysis pipeline, use a workflow management system like Snakemake or Nextflow. While these are considered as workflow systems, these are frameworks for building analysis pipelines.

If you are going to analyse high-throughput sequencing data, use the infrastructure/framework developed by the Bioconductor project. Many of the Bioconductor packages use the same underlying data structure, such as GenomicRanges, so once you have pre-processed your data into a common format, you can analyse/visualise your data using various Bioconductor packages.

If you are performing machine learning, use the tidymodels framework if you use R or scikit-learn for Python. These packages provide all the code for performing typical machine learning analyses, such as splitting your data up, performing cross-validation, plotting performance measures, etc.

If what you have to do is not available and you have to write the code yourself, you can still adapt a best practices framework for developing your code in the language of choice. If you use R, you can follow the guidelines in the R Packages book, which you can read for free!

On every level of your analysis, you can follow or use some framework to help you develop code that is much easier to maintain and is more likely to be reproducible. Even when you are writing a single script, you can follow these Ten recommendations for creating usable bioinformatics command line software.

Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software.

Why people don’t use frameworks

Learning curve
Cost-benefit

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.4       httr_1.4.7        cli_3.6.1         knitr_1.44       
 [5] rlang_1.1.1       xfun_0.40         stringi_1.7.12    processx_3.8.2   
 [9] promises_1.2.1    jsonlite_1.8.7    glue_1.6.2        rprojroot_2.0.3  
[13] git2r_0.32.0      htmltools_0.5.6.1 httpuv_1.6.12     ps_1.7.5         
[17] sass_0.4.7        fansi_1.0.5       rmarkdown_2.25    jquerylib_0.1.4  
[21] tibble_3.2.1      evaluate_0.22     fastmap_1.1.1     yaml_2.3.7       
[25] lifecycle_1.0.3   whisker_0.4.1     stringr_1.5.0     compiler_4.3.2   
[29] fs_1.6.3          pkgconfig_2.0.3   Rcpp_1.0.11       rstudioapi_0.15.0
[33] later_1.3.1       digest_0.6.33     R6_2.5.1          utf8_1.2.4       
[37] pillar_1.9.0      callr_3.7.3       magrittr_2.0.3    bslib_0.5.1      
[41] tools_4.3.2       cachem_1.0.8      getPass_0.2-2

Frameworks and guidelines

Dave Tang

2023-11-24

Introduction

Why people don’t use frameworks