Last updated: 2021-03-21
Checks: 7 0
Knit directory:
fa_sim_cal/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20201104)
was run prior to running the code in the R Markdown file.
Setting a seed ensures that any results that rely on randomness, e.g.
subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 981bddc. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the
analysis have been committed to Git prior to generating the results (you can
use wflow_publish
or wflow_git_commit
). workflowr only
checks the R Markdown file, but you know if there are other scripts or data
files that it depends on. Below is the status of the Git repository when the
results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: .tresorit/
Ignored: _targets/
Ignored: data/VR_20051125.txt.xz
Ignored: output/ent_cln.fst
Ignored: output/ent_raw.fst
Ignored: renv/library/
Ignored: renv/staging/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made
to the R Markdown (analysis/workflow.Rmd
) and HTML (docs/workflow.html
)
files. If you’ve configured a remote Git repository (see
?wflow_git_remote
), click on the hyperlinks in the table below to
view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | d4a106d | Ross Gayler | 2021-03-20 | WIP |
html | ac6b6da | Ross Gayler | 2021-03-04 | Build site. |
Rmd | dbd6fbb | Ross Gayler | 2021-03-04 | Add some useful links to the workflow document |
html | 67e6fdf | Ross Gayler | 2021-03-03 | Build site. |
Rmd | 5b5369f | Ross Gayler | 2021-03-03 | Add workflow management notes |
Rmd | d2d559e | Ross Gayler | 2021-03-02 | end of day |
Rmd | 199d85e | Ross Gayler | 2021-03-02 | end of day |
Rmd | 9ba0dc4 | Ross Gayler | 2021-03-01 | end of day |
Rmd | 55ee0b1 | Ross Gayler | 2021-02-27 | end of day |
This project uses the targets
and workflowr
packages for
managing the workflow of the project (making sure that the dependencies
between computational steps are satisfied). When this work was started
there were no easily found examples of using targets
and workflowr
together. This notebook contains notes on the proposed workflow for
using targets
and workflowr
.
These points reflect my (possibly faulty) understanding of targets
and
workflowr
. If I am wrong here I hope that somebody will see this and
let me know, rather than me having to find out the hard way.
targets
and workflowr
both work by tracking some set of entities
and the computational dependencies between them. When any of the
tracked entities changes the packages calculate the minimal set of
downstream dependencies that need to be recomputed to bring all the
entities into a consistent state of being up to date.
targets
Supports a computational-pipeline-centric style of analysis
Tracks data objects (including files) and functions.
The focus is on the data transformation by the computational pipeline (rather than human generated text in reports).
Knows about high-performance computing and can run computations in parallel.
workflowr
Supports a notebook-centric style of analysis
Only tracks Rmd notebook files and the corresponding rendered output files (https://github.com/ropensci/tarchetypes/issues/23#issuecomment-749118599)
workflowr::wflow_build()
tracks modification dates of Rmd
files and the corresponding rendered output filesworkflowr::wflow_publish()
tracks git status of Rmd files
and the corresponding rendered output filesThe computational reproducibility aspect is restricted to ensuring that random number seeds are set appropriately, that each notebook is executed in a clean environment, and that the package versions are recorded
Automatic building of a website for the rendered notebooks
Publication of website integrated with git
Automatic publication of website served by GitHub Pages
Comparison
targets
provides more general and fine-grained control of
computational pipeline
targets
computational pipelinetargets
If I use targets
to manage computational dependency tracking,
what extra capabilities does workflowr
provide?
workflowr
to handle the building and
publication of a project website.Design states and design reasoning
The design state of the computational pipeline (loosely defined) reflects the current best beliefs. That is, any previous states were believed to be flawed in some way and the current design state is believed to be better.
While the system is being modified from the prior design state to the current design state it is transiently broken and there is no need for that broken state to be preserved and easily accessible later.
Prior design states are not interesting (because they are believed to be worse than the current design state) so there is no need for prior design states to be preserved and easily accessible later.
The reasoning behind the current design state is important and must be preserved and immediately accessible.
The reasoning behind the current design state may involve reference to prior design states and prior design reasoning. Where these references are needed they are inclluded directly in the design reasoning for the current design state and are preserved and immediately accessible.
The proposed workflow needs to support my preferences for how to organise a project. In particular, a computational research project necessarily involves many design choices for the computational details. It is my strong preference that the reasoning behind these design choices (which may involve additional empirical work to support the reasoning) is documented as part of the project.
The total workflow of the project has multiple components:
The core is implemented as a standard targets
computational
pipeline.
The publications are implemented as extensions to the core pipeline.
workflowr
notebooks).targets
.The meta components are implemented as short chains hanging off the core pipeline.
workflowr
Rmd notebooks.targets
.workflowr
website.workflowr
website also has links to the rendered
publications.An example is summarised in the following diagram.
The arrows represent data flows (dependencies). These dependencies allow
targets
to work out what is out of date and therefore requiring
re-execution conditional on any of the tracked entities being modified.
The circles represent data objects (R objects and files).
The double circles represent Rmarkdown files. targets
treats them like
any other data object, but I have distinguished them in this diagram.
The triangles represent functions that generate or transform data.
The hexagons represent rendered Rmarkdown files.
The red nodes represent the core pipeline. Data is ingested and repeatedly transformed by functions.
The green nodes represent the publication workflow. There can be multiple publications derived from the core.
A publication may apply functions to core data to generate summaries for inclusion in the publication (e.g. plots, tables).
The text of the publication is in the plain Rmarkdown file.
The publication Rmarkdown file and the data it depends on are knitted to generate the rendered publication.
The gold nodes represent the meta publication workflow. The two dark gold nodes are a special case of the meta publication workflow.
A meta publication typically applies functions to some core data to generate summaries which inform the design reasoning for the next set of functions in the core pipeline. (Other patterns are possible, taking data from multiple core data objects, or even no data at all.)
The gold double circle nodes represent workflowr
Rmarkdown files.
These contain the text of the reasoning behind the design decisions.
The workflowr
Rmarkdown file and the data it depends on are knitted to
generate the rendered meta publication recording the reasoning behind
some design decisions.
There can be many meta publications. They are the documentation of the design of the project.
The two dark gold nodes represent the website part of the meta publication workflow.
The workflowr
index Rmarkdown normally contains links to all the
rendered documents of interest (meta publications and external
publications) and is rendered to become the home page of the project
website.
These are not immediately relevant to how I think the current project will pan out, but I could easily imagine them being relevant in other projects.
These questions arise because of my earlier assumption that targets
and workflowr
are focused on creating a current status of the project
that is computationally consistent (up to date). Consequently, I have
assumed that the current status of the project is directly accessible,
but prior consistent states of the project are not directly (and
therefore not easily) accessible from within the project. The questions
below relate to use-cases where we would want current and prior states
of the project to be simultaneously and directly available.
These questions are based on the assumption that it’s advantageous to
have the entire computational process managed by targets
or
workflowr
to ensure that everything is in a consistent state. That is,
I am trying to avoid having any computational processes that are not
managed by targets
or workflowr
.
I suspect that tarchetypes
and branching may be relevant to these questions.
Imagine a computational pipeline that ingests some data and reports on it. The current output reports reflect the current input data.
Now imagine that the input data is regularly updated. Whenever the input data is updated the output reports would also be updated so that the previous output no longer exists in the pipeline environment. However, in that use-case it is generally required that all the generated reports continue to be available.
The results could be accumulated outside the computational pipeline but
that would appear to mean that part of the computational process is not
visible to and managed by targets
or workflowr
.
So, is there a reasonable way in targets
or workflowr
to
accumulate an arbitrary number of analyses/results from the same
pipeline?
As a related question, would that support regenerating all the reports if the pipeline functions were updated (e.g. a bug was fixed)?
Related to the previous question: Would it be possible to accumulate reports corresponding to different definitions of the pipeline functions (e.g. applying different modelling techniques to the data)?
The last question makes clear that referring to “historical” is somewhat misleading. It would more generally be thought of as (potentially multidimensional) indexing across data sets and pipeline definitions.
This might be easier to do if it was conceptualised as a point in time computation with the indexing variable(s) used as a grouping variable(s) in one input data set. However, that point in time view would require recomputing all the outputs when the input is updated, even though recomputing the previous outputs is unnecessary.
The previous questions dealt with the case where the outputs are just
accumulated. Now consider the case where outputs for different index
values are combined computationally inside the pipeline managed by
targets
or workflowr
. This might be used to look at how output
values change over time or over changes in the pipeline definition.
targets
or workflowr
to accumulate
an arbitrary number of analyses/results, indexed by data sets and/or
pipeline definitions, from the same pipeline such that indexed
results can be computational combined in later steps of the same
pipeline?
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] DiagrammeR_1.0.6.1 here_1.0.1 workflowr_1.6.2 targets_0.2.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 RColorBrewer_1.1-2 git2r_0.28.0 jquerylib_0.1.3
[5] bslib_0.2.4 compiler_4.0.4 pillar_1.5.1 later_1.1.0.1
[9] tools_4.0.4 digest_0.6.27 jsonlite_1.7.2 evaluate_0.14
[13] lifecycle_1.0.0 tibble_3.1.0 pkgconfig_2.0.3 rlang_0.4.10
[17] igraph_1.2.6 rstudioapi_0.13 cli_2.3.1 yaml_2.2.1
[21] xfun_0.22 stringr_1.4.0 withr_2.4.1 knitr_1.31
[25] htmlwidgets_1.5.3 sass_0.3.1 vctrs_0.3.6 fs_1.5.0
[29] rprojroot_2.0.2 tidyselect_1.1.0 glue_1.4.2 data.table_1.14.0
[33] R6_2.5.0 processx_3.4.5 fansi_0.4.2 bookdown_0.21
[37] rmarkdown_2.7 whisker_0.4 callr_3.5.1 purrr_0.3.4
[41] magrittr_2.0.1 codetools_0.2-18 ps_1.6.0 promises_1.2.0.1
[45] ellipsis_0.3.1 htmltools_0.5.1.1 assertthat_0.2.1 renv_0.13.1
[49] httpuv_1.5.5 utf8_1.2.1 stringi_1.5.3 visNetwork_2.0.9
[53] crayon_1.4.1