Last updated: 2021-03-03

This project uses the targets and workflowr packages for managing the workflow of the project (making sure that the dependencies between computational steps are satisfied). When this work was started there were no easily found examples of using targets and workflowr together. This notebook contains notes on the proposed workflow for using targets and workflowr.

1 Assumptions

These points reflect my (possibly faulty) understanding of targets and workflowr. If I am wrong here I hope that somebody will see this and let me know, rather than me having to find out the hard way.

  • targets and workflowr both work by tracking some set of entities and the computational dependencies between them. When any of the tracked entities changes the packages calculate the minimal set of downstream dependencies that need to be recomputed to bring all the entities into a consistent state of being up to date.

  • targets

    • Supports a computational-pipeline-centric style of analysis

    • Tracks data objects (including files) and functions.

    • The focus is on the data transformation by the computational pipeline (rather than human generated text in reports).

      • Although reports can be rendered, they appear to be treated as optional final steps like generating a plot, rather than as a core concern.
    • Knows about high-performance computing and can run computations in parallel.

  • workflowr

    • Supports a notebook-centric style of analysis

    • Only tracks Rmd notebook files and the corresponding rendered output files (

      • workflowr::wflow_build() tracks modification dates of Rmd files and the corresponding rendered output files
      • workflowr::wflow_publish() tracks git status of Rmd files and the corresponding rendered output files
      • The computational consistency aspect are really only about the consistency between the notebook Rmd files and their rendered counterparts
    • The computational reproducibility aspect is restricted to ensuring that random number seeds are set appropriately, that each notebook is executed in a clean environment, and that the package versions are recorded

    • Automatic building of a website for the rendered notebooks

    • Publication of website integrated with git

    • Automatic publication of website served by GitHub Pages

  • Comparison

    • targets provides more general and fine-grained control of computational pipeline

      • Rendering of Rmd documents can be treated as just another step in a targets computational pipeline
      • I would prefer to have computational dependency tracking handled by one package rather than splitting responsibility across multiple packages
      • If I am going to use only one package to handle computational dependency tracking then I think it has to be targets
    • If I use targets to manage computational dependency tracking, what extra capabilities does workflowr provide?

      • Automatic generation of a website mof rendered notebooks
      • Automatic publication of the website via GitHub pages (provided you use GitHub as the git remote repository).
      • I want this project to be publicly accessible and I don’t want the trouble of having to manually generate a website, so I will use workflowr to handle the building and publication of a project website.
  • Design states and design reasoning

    • The design state of the computational pipeline (loosely defined) reflects the current best beliefs. That is, any previous states were believed to be flawed in some way and the current design state is believed to be better.

    • While the system is being modified from the prior design state to the current design state it is transiently broken and there is no need for that broken state to be preserved and easily accessible later.

    • Prior design states are not interesting (because they are believed to be worse than the current design state) so there is no need for prior design states to be preserved and easily accessible later.

    • The reasoning behind the current design state is important and must be preserved and immediately accessible.

    • The reasoning behind the current design state may involve reference to prior design states and prior design reasoning. Where these references are needed they are inclluded directly in the design reasoning for the current design state and are preserved and immediately accessible.

      • Thes design reasoning consists of manually constructed text, possibly supported by specific analyses of data derived from the computational pipeline.

2 Proposed workflow

  • The proposed workflow needs to support my preferences for how to organise a project. In particular, a computational research project necessarily involves many design choices for the computational details. It is my strong preference that the reasoning behind these design choices (which may involve additional empirical work to support the reasoning) is documented as part of the project.

  • The total workflow of the project has multiple components:

    • core - This is the workflow that generates the primary computational outputs (data sets, tables, plots, etc.) of the project. None of the outputs of this workflow include manual interpretation. The transformations are purely mechanical and can be repeated automatically.
    • publications - These are documents (manuscripts, presentations, etc.) that interpret the results of the core for some audience. These should be construed as the principal outputs of the core but they are treated separately because they necessarily involve interpretation which cannot be automated. The computation which generates the publications can be automated, but it can’t automatically update the interpretations in the publications.
    • meta - These are computations and interpretations that are about the core, but not required by the core. This where the design reasoning lives.
  • The core is implemented as a standard targets computational pipeline.

  • The publications are implemented as extensions to the core pipeline.

    • There may be computational steps to generate data objects needed specifically for publications (e.g. plots and tables).
    • It is recommended that computation in Rmd documents is minimised.
    • The publications are plain Rmd documents (not workflowr notebooks).
    • The publication-specific data objects and the rendering of the Rmd documents are managed by targets.
  • The meta components are implemented as short chains hanging off the core pipeline.

    • There may be computational steps to generate data objects needed specifically for meta documents.
    • However, there is less pressure to minimise computation in the Rmd documents.
    • The meta publications are workflowr Rmd notebooks.
    • The meta-specific data objects and the rendering of the Rmd documents are managed by targets.
    • The meta documents are rendered to a workflowr website.
    • The workflowr website also has links to the rendered publications.

An example is summarised in the following diagram.

The arrows represent data flows (dependencies). These dependencies allow targets to work out what is out of date and therefore requiring re-execution conditional on any of the tracked entities being modified.

The circles represent data objects (R objects and files).

The double circles represent Rmarkdown files. targets treats them like any other data object, but I have distinguished them in this diagram.

The triangles represent functions that generate or transform data.

The hexagons represent rendered Rmarkdown files.

The red nodes represent the core pipeline. Data is ingested and repeatedly transformed by functions.

The green nodes represent the publication workflow. There can be multiple publications derived from the core.

A publication may apply functions to core data to generate summaries for inclusion in the publication (e.g. plots, tables).

The text of the publication is in the plain Rmarkdown file.

The publication Rmarkdown file and the data it depends on are knitted to generate the rendered publication.

The gold nodes represent the meta publication workflow. The two dark gold nodes are a special case of the meta publication workflow.

A meta publication typically applies functions to some core data to generate summaries which inform the design reasoning for the next set of functions in the core pipeline. (Other patterns are possible, taking data from multiple core data objects, or even no data at all.)

The gold double circle nodes represent workflowr Rmarkdown files. These contain the text of the reasoning behind the design decisions.

The workflowr Rmarkdown file and the data it depends on are knitted to generate the rendered meta publication recording the reasoning behind some design decisions.

There can be many meta publications. They are the documentation of the design of the project.

The two dark gold nodes represent the website part of the meta publication workflow.

The workflowr index Rmarkdown normally contains links to all the rendered documents of interest (meta publications and external publications) and is rendered to become the home page of the project website.

3 Open questions

These are not immediately relevant to how I think the current project will pan out, but I could easily imagine them being relevant in other projects.

These questions arise because of my earlier assumption that targets and workflowr are focused on creating a current status of the project that is computationally consistent (up to date). Consequently, I have assumed that the current status of the project is directly accessible, but prior consistent states of the project are not directly (and therefore not easily) accessible from within the project. The questions below relate to use-cases where we would want current and prior states of the project to be simultaneously and directly available.

These questions are based on the assumption that it’s advantageous to have the entire computational process managed by targets or workflowr to ensure that everything is in a consistent state. That is, I am trying to avoid having any computational processes that are not managed by targets or workflowr.

I suspect that tarchetypes and branching may be relevant to these questions.

3.1 Accumulating historical/indexed states

Imagine a computational pipeline that ingests some data and reports on it. The current output reports reflect the current input data.

Now imagine that the input data is regularly updated. Whenever the input data is updated the output reports would also be updated so that the previous output no longer exists in the pipeline environment. However, in that use-case it is generally required that all the generated reports continue to be available.

The results could be accumulated outside the computational pipeline but that would appear to mean that part of the computational process is not visible to and managed by targets or workflowr.

  • So, is there a reasonable way in targets or workflowr to accumulate an arbitrary number of analyses/results from the same pipeline?

  • As a related question, would that support regenerating all the reports if the pipeline functions were updated (e.g. a bug was fixed)?

  • Related to the previous question: Would it be possible to accumulate reports corresponding to different definitions of the pipeline functions (e.g. applying different modelling techniques to the data)?

The last question makes clear that referring to “historical” is somewhat misleading. It would more generally be thought of as (potentially multidimensional) indexing across data sets and pipeline definitions.

This might be easier to do if it was conceptualised as a point in time computation with the indexing variable(s) used as a grouping variable(s) in one input data set. However, that point in time view would require recomputing all the outputs when the input is updated, even though recomputing the previous outputs is unnecessary.

3.2 Comparing historical/indexed states

The previous questions dealt with the case where the outputs are just accumulated. Now consider the case where outputs for different index values are combined computationally inside the pipeline managed by targets or workflowr. This might be used to look at how output values change over time or over changes in the pipeline definition.

  • Is there a reasonable way in targets or workflowr to accumulate an arbitrary number of analyses/results, indexed by data sets and/or pipeline definitions, from the same pipeline such that indexed results can be computational combined in later steps of the same pipeline?

