Last updated: 2021-03-03

Checks: 7 0

Knit directory: fa_sim_cal/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20201104) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 5b5369f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    .tresorit/
    Ignored:    data/VR_20051125.txt.xz
    Ignored:    output/blk_char.fst
    Ignored:    output/ent_blk.fst
    Ignored:    output/ent_cln.fst
    Ignored:    output/ent_raw.fst
    Ignored:    renv/library/
    Ignored:    renv/staging/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/workflow.Rmd) and HTML (docs/workflow.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 5b5369f Ross Gayler 2021-03-03 Add workflow management notes
Rmd d2d559e Ross Gayler 2021-03-02 end of day
Rmd 199d85e Ross Gayler 2021-03-02 end of day
Rmd 9ba0dc4 Ross Gayler 2021-03-01 end of day
Rmd 55ee0b1 Ross Gayler 2021-02-27 end of day

This project uses the targets and workflowr packages for managing the workflow of the project (making sure that the dependencies between computational steps are satisfied). When this work was started there were no easily found examples of using targets and workflowr together. This notebook contains notes on the proposed workflow for using targets and workflowr.

1 Assumptions

These points reflect my (possibly faulty) understanding of targets and workflowr. If I am wrong here I hope that somebody will see this and let me know, rather than me having to find out the hard way.

  • targets and workflowr both work by tracking some set of entities and the computational dependencies between them. When any of the tracked entities changes the packages calculate the minimal set of downstream dependencies that need to be recomputed to bring all the entities into a consistent state of being up to date.

  • targets

    • Supports a computational-pipeline-centric style of analysis

    • Tracks data objects (including files) and functions.

    • The focus is on the data transformation by the computational pipeline (rather than human generated text in reports).

      • Although reports can be rendered, they appear to be treated as optional final steps like generating a plot, rather than as a core concern.
    • Knows about high-performance computing and can run computations in parallel.

  • workflowr

    • Supports a notebook-centric style of analysis

    • Only tracks Rmd notebook files and the corresponding rendered output files (https://github.com/ropensci/tarchetypes/issues/23#issuecomment-749118599)

      • workflowr::wflow_build() tracks modification dates of Rmd files and the corresponding rendered output files
      • workflowr::wflow_publish() tracks git status of Rmd files and the corresponding rendered output files
      • The computational consistency aspect are really only about the consistency between the notebook Rmd files and their rendered counterparts
    • The computational reproducibility aspect is restricted to ensuring that random number seeds are set appropriately, that each notebook is executed in a clean environment, and that the package versions are recorded

    • Automatic building of a website for the rendered notebooks

    • Publication of website integrated with git

    • Automatic publication of website served by GitHub Pages

  • Comparison

    • targets provides more general and fine-grained control of computational pipeline

      • Rendering of Rmd documents can be treated as just another step in a targets computational pipeline
      • I would prefer to have computational dependency tracking handled by one package rather than splitting responsibility across multiple packages
      • If I am going to use only one package to handle computational dependency tracking then I think it has to be targets
    • If I use targets to manage computational dependency tracking, what extra capabilities does workflowr provide?

      • Automatic generation of a website mof rendered notebooks
      • Automatic publication of the website via GitHub pages (provided you use GitHub as the git remote repository).
      • I want this project to be publicly accessible and I don’t want the trouble of having to manually generate a website, so I will use workflowr to handle the building and publication of a project website.
  • Design states and design reasoning

    • The design state of the computational pipeline (loosely defined) reflects the current best beliefs. That is, any previous states were believed to be flawed in some way and the current design state is believed to be better.

    • While the system is being modified from the prior design state to the current design state it is transiently broken and there is no need for that broken state to be preserved and easily accessible later.

    • Prior design states are not interesting (because they are believed to be worse than the current design state) so there is no need for prior design states to be preserved and easily accessible later.

    • The reasoning behind the current design state is important and must be preserved and immediately accessible.

    • The reasoning behind the current design state may involve reference to prior design states and prior design reasoning. Where these references are needed they are inclluded directly in the design reasoning for the current design state and are preserved and immediately accessible.

      • Thes design reasoning consists of manually constructed text, possibly supported by specific analyses of data derived from the computational pipeline.

2 Proposed workflow

  • The proposed workflow needs to support my preferences for how to organise a project. In particular, a computational research project necessarily involves many design choices for the computational details. It is my strong preference that the reasoning behind these design choices (which may involve additional empirical work to support the reasoning) is documented as part of the project.

  • The total workflow of the project has multiple components:

    • core - This is the workflow that generates the primary computational outputs (data sets, tables, plots, etc.) of the project. None of the outputs of this workflow include manual interpretation. The transformations are purely mechanical and can be repeated automatically.
    • publications - These are documents (manuscripts, presentations, etc.) that interpret the results of the core for some audience. These should be construed as the principal outputs of the core but they are treated separately because they necessarily involve interpretation which cannot be automated. The computation which generates the publications can be automated, but it can’t automatically update the interpretations in the publications.
    • meta - These are computations and interpretations that are about the core, but not required by the core. This where the design reasoning lives.
  • The core is implemented as a standard targets computational pipeline.

  • The publications are implemented as extensions to the core pipeline.

    • There may be computational steps to generate data objects needed specifically for publications (e.g. plots and tables).
    • It is recommended that computation in Rmd documents is minimised.
    • The publications are plain Rmd documents (not workflowr notebooks).
    • The publication-specific data objects and the rendering of the Rmd documents are managed by targets.
  • The meta components are implemented as short chains hanging off the core pipeline.

    • There may be computational steps to generate data objects needed specifically for meta documents.
    • However, there is less pressure to minimise computation in the Rmd documents.
    • The meta publications are workflowr Rmd notebooks.
    • The meta-specific data objects and the rendering of the Rmd documents are managed by targets.
    • The meta documents are rendered to a workflowr website.
    • The workflowr website also has links to the rendered publications.

An example is summarised in the following diagram.

The arrows represent data flows (dependencies). These dependencies allow targets to work out what is out of date and therefore requiring re-execution conditional on any of the tracked entities being modified.

The circles represent data objects (R objects and files).

The double circles represent Rmarkdown files. targets treats them like any other data object, but I have distinguished them in this diagram.

The triangles represent functions that generate or transform data.

The hexagons represent rendered Rmarkdown files.

The red nodes represent the core pipeline. Data is ingested and repeatedly transformed by functions.

The green nodes represent the publication workflow. There can be multiple publications derived from the core.

A publication may apply functions to core data to generate summaries for inclusion in the publication (e.g. plots, tables).

The text of the publication is in the plain Rmarkdown file.

The publication Rmarkdown file and the data it depends on are knitted to generate the rendered publication.

The gold nodes represent the meta publication workflow. The two dark gold nodes are a special case of the meta publication workflow.

A meta publication typically applies functions to some core data to generate summaries which inform the design reasoning for the next set of functions in the core pipeline. (Other patterns are possible, taking data from multiple core data objects, or even no data at all.)

The gold double circle nodes represent workflowr Rmarkdown files. These contain the text of the reasoning behind the design decisions.

The workflowr Rmarkdown file and the data it depends on are knitted to generate the rendered meta publication recording the reasoning behind some design decisions.

There can be many meta publications. They are the documentation of the design of the project.

The two dark gold nodes represent the website part of the meta publication workflow.

The workflowr index Rmarkdown normally contains links to all the rendered documents of interest (meta publications and external publications) and is rendered to become the home page of the project website.

3 Open questions

These are not immediately relevant to how I think the current project will pan out, but I could easily imagine them being relevant in other projects.

These questions arise because of my earlier assumption that targets and workflowr are focused on creating a current status of the project that is computationally consistent (up to date). Consequently, I have assumed that the current status of the project is directly accessible, but prior consistent states of the project are not directly (and therefore not easily) accessible from within the project. The questions below relate to use-cases where we would want current and prior states of the project to be simultaneously and directly available.

These questions are based on the assumption that it’s advantageous to have the entire computational process managed by targets or workflowr to ensure that everything is in a consistent state. That is, I am trying to avoid having any computational processes that are not managed by targets or workflowr.

I suspect that tarchetypes and branching may be relevant to these questions.

3.1 Accumulating historical/indexed states

Imagine a computational pipeline that ingests some data and reports on it. The current output reports reflect the current input data.

Now imagine that the input data is regularly updated. Whenever the input data is updated the output reports would also be updated so that the previous output no longer exists in the pipeline environment. However, in that use-case it is generally required that all the generated reports continue to be available.

The results could be accumulated outside the computational pipeline but that would appear to mean that part of the computational process is not visible to and managed by targets or workflowr.

  • So, is there a reasonable way in targets or workflowr to accumulate an arbitrary number of analyses/results from the same pipeline?

  • As a related question, would that support regenerating all the reports if the pipeline functions were updated (e.g. a bug was fixed)?

  • Related to the previous question: Would it be possible to accumulate reports corresponding to different definitions of the pipeline functions (e.g. applying different modelling techniques to the data)?

The last question makes clear that referring to “historical” is somewhat misleading. It would more generally be thought of as (potentially multidimensional) indexing across data sets and pipeline definitions.

This might be easier to do if it was conceptualised as a point in time computation with the indexing variable(s) used as a grouping variable(s) in one input data set. However, that point in time view would require recomputing all the outputs when the input is updated, even though recomputing the previous outputs is unnecessary.

3.2 Comparing historical/indexed states

The previous questions dealt with the case where the outputs are just accumulated. Now consider the case where outputs for different index values are combined computationally inside the pipeline managed by targets or workflowr. This might be used to look at how output values change over time or over changes in the pipeline definition.

  • Is there a reasonable way in targets or workflowr to accumulate an arbitrary number of analyses/results, indexed by data sets and/or pipeline definitions, from the same pipeline such that indexed results can be computational combined in later steps of the same pipeline?

4 Useful inks


R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] DiagrammeR_1.0.6.1 here_1.0.1         workflowr_1.6.2   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6         pillar_1.5.0       compiler_4.0.3     later_1.1.0.1     
 [5] RColorBrewer_1.1-2 git2r_0.28.0       tools_4.0.3        digest_0.6.27     
 [9] jsonlite_1.7.2     evaluate_0.14      lifecycle_1.0.0    tibble_3.1.0      
[13] pkgconfig_2.0.3    rlang_0.4.10       rstudioapi_0.13    yaml_2.2.1        
[17] xfun_0.21          stringr_1.4.0      knitr_1.31         fs_1.5.0          
[21] vctrs_0.3.6        htmlwidgets_1.5.3  rprojroot_2.0.2    glue_1.4.2        
[25] R6_2.5.0           fansi_0.4.2        rmarkdown_2.7      bookdown_0.21     
[29] magrittr_2.0.1     whisker_0.4        promises_1.2.0.1   ellipsis_0.3.1    
[33] htmltools_0.5.1.1  renv_0.13.0        httpuv_1.5.5       utf8_1.1.4        
[37] stringi_1.5.3      visNetwork_2.0.9   crayon_1.4.1