Last updated: 2021-07-01
Checks: 7 0
Knit directory: mapme.protectedareas/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20210305)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 364ae4b. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .RData
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data-raw/addons/docs/rest/
Ignored: data-raw/addons/etc/
Ignored: data-raw/addons/scripts/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/contribute.Rmd
) and HTML (public/contribute.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | dd5c325 | Johannes Schielein | 2021-06-30 | Build site. |
html | 8bd1321 | Johannes Schielein | 2021-06-30 | Host with GitLab. |
html | 3a39ee3 | Johannes Schielein | 2021-06-30 | Host with GitHub. |
html | ae67dca | Johannes Schielein | 2021-06-30 | Host with GitLab. |
Rmd | e9e6d98 | Johannes Schielein | 2021-03-12 | updated the website to be rendered. |
This file includes information on how to use this repository and create good contributions. We also use a code of conduct for open source projects adapted from the Contributor Covenant homepage. You can find this conduct in the file CoC.md.
There are two different ways to use this repository. The quickest way is to just follow the tutorial scripts and manually install whatever libraries are required in the scripts. This should work most of the times, however, for there might be changes to some of the utilized libraries in this repository in the future, so some of the functions from the tutorials here might not be reproducible anymore because the code might be outdated. If this is so you can use all the “old” libraries that come as part of this repository as well. Just follow the steps in the next paragraph if you want to copy the whole repository to your local machine and use exactly the same libraries that where utilized to create the scripts.
This repository uses renv
package to allow for version control. To use this repository on your local machine you can do the following
renv
package on your local machinerenv::restore()
. This should reinstall all necessary libraries for this repository on your local machine.Add here information on where and how to report bugs and add suggestions (later)
In order to ensure reproducibility, proper documentation and open-source code is key. In addition to making all routines public, we also have a strong focus on the proper documentation of utilized methods. Here are the necessary steps contributors should do to make a contribution to this repository.
sf
and terra
. Please try to write your functions using these packages instead of the older packages sp
and raster
.R
scripts to process our data and Rmarkdownscripts (Rmd)
to document the utilized data, methods, pre-processing routines and analysis scripts. We have some basic recommendations and minimum standards for code developemt which are shown in section Routines for preprocessing Datasets.This repository is structured as a reproducible project using the workflowr package. For more information please visit workflowr website. In addition, this repository used the the renv package to enable others to exactly work with the same package versions that had been utilized in this project. For more information please visit the renv website.
mapme.protectedareas/
.
├── .gitignore # specifies which files to ignore from the git
├── .Rprofile # contains information on which libraries and settings should be used at start
├── _workflowr.yml # yml file needed to control workflowr
├── analysis/
├── about.Rmd
├── index.Rmd
├── license.Rmd
├── *.Rmd # Rmarkdown files to document pre-processing and analysis routines
└── _site.yml
├── code/
├── *.R # R-functions and scripts utilized in Rmd
├── data/ # raw imputable sample data for the routine.
├── docs/ # htmls builds of the Rmds in analysis/
├── output/ # intermediate and final data output from sample data
├── README.md # general information on the project
└── renv/ # renv directory to lock used package versions
This repository uses a datalake to store input, output and processing data. This datalake is non public because of its size and partly sensitive nature. For every dataset we also provide information where the data originated from, so you might want to recreate the datalake structure on your local machine if you use this repository. Our datalake structure is organized as follows:
datalake/
.
├── mapme.protectedares/ # specifies the project folder. is named in the same way as the repository
├── input/ # contains unprocessed input data with original filenames
├── teow/
├── global_mangrove_watch/
├── net_carbon_flux/
└── ...
├── output/ # here goes all relevant, processed output data
├── polygon / # the polygon-data on country level of supported and non supported PAs
├── raster / # raster representation of variables for the impact evaluation
└── tabular / # tabular data containing wdpa_id (lines) and processed variables (columns)
├── 1_full_database / merged final table of all individual variables
├── teow/
├── global_mangrove_watch/
├── net_carbon_flux/
└── ...
└── processing/ # contains data that is generated during processing.
Naming conventions for datalake: Please name all folders and processed datasets lowercase and with underline _ as seperator instead of whitspace. Please keep the original data names in the input folder. Please, also make sure to keep the clean the processing folder regularly. Nevertheless, if there are datasets that take very long processing time you might want to keep them permanently in the processing folder.
This is a step-by-step guide to create a new pre-processing routine based on the example of TEOW ecoregions dataset from WWF. Pre-processing routines are organized in this repository along the line of thematic data-sets from differing data-sources. The intention of this organization is to have a modular structure that allows for easy adding or deleting data-sources from the pre-processing routine and eventually chain the routines together to preprocess several variables consecutively for a number of given areas.
Also this allows us to debug the code more easily if the routine is chained. An exception to this structure are routines that allow access to already pre-processed data-sets, in our case the API
-Access DOPA/JRC Rest Services, which provides tabular information for several thematic data-sets pre-processed by JRC on the base of PAs.
Please try to develop your routine using a small subset of sample data that takes few processing time for the documentation and has low storage requirements. For PAs we recommend e.g. to use 3 or 4 PAs from your area of interest. We currently use the wdpar
package to automatically download and preprocess sample PAs from the WDPA. You can also save the sample data in the data folder. This is recommended if the dataset you use has no definitive storage place on the internet. Nevertheless, it is encouraged to include the downloading process in the routine as well and save the sample data in the temporary folder that is created by the Rsession.
To create a new pre-processing routine you should
R
-script and develop your routine in the code folder. Details and naming conventions are given below in section R scripts. In this script you should develop the data-processing routine which will run (at least) on the whole PA portfolio from KfW. Depending on the routine this can include several steps e.g. downloading, cleaning, merging or stacking different input layers, geo-processing and final data wrangling to achieve an adequate output structure (details below). Try to document with inline comments all steps that are applied to the data. Make sure to realize all steps (if possible) using R packages terra
and sf
plus auxiliary packages such as the tidyr
and dpylr
for data wrangling. Please make sure that your final output data is structured and stored as a long table. For an example see section Output data structureRmd
file and place it in the analysis folder (see structure above). Best practice is to use the wflow_open()
command which will create a new Rmd
file. Name this new file according to the pre-processed data-source and variable e.g. wwf_teow.Rmd
for TEOW Ecoregions from the World Wildlife Fund for Nature (WWF) or gfw_forests.Rmd
for different variables from the Global Forest Watch (GFW) such as forest cover, forest cover loss or emissions from forest cover loss. The full command to create a new script in our example would therefore look like this wflow_open("analysis/wwf_teow.Rmd")
.Rmd
file using a minimal reproducible example. Please try to follow our minimum standards given in the section Contents of the Rmd file.index.Rmd
. You can see how to do this in section Render and link your report.You should try to develop R-functions that are seperated in an R-script and then sourced in the Rmd
files. Those functions and R-scripts should be placed in the code folder (see structure above). This is especially relevant for code which can be re-used in several pre-processing routines such as chained pre-processing steps e.g. reproject
-> rasterize
-> stack
-> zonal
.
Here you have two differing naming conventions when saving the script:
R
file to the name of your Rmd
file. This could be for example preprocessing_wwf_teow.R. Use this if your code is most probably specific to the dataset that you process.index.Rmd
file as described in the subsequent section.All of the files in the analysis folder will be rendered to create the project website. In order to render new Rmd
files use the function wflow_build()
from the workflowr
package. After rendering, html files will be created in the docs folder. PLEASE MAKE SURE to create new links in the index.Rmd
that will reference to this new html files. This will ensure, that the new routines appear in the rendered website afterwards.
PLEASE MAKE also sure to document newly created R
files in index.Rmd
to allow others to get a quick assessment of the existing pre-processing routines. You only need to do this if you think your routine could be used also outside of the context in which you processed your dataset (i.e. more generic methods as shown above in the example of chained processing steps). If possible you can also use a graphical expression of the workflow (see examples in index) which will enable others to understand your routine quickly. Images for such graphical expressions should be stored in docs as well containing the name of the routine. For the example above this would be rasterize_and_zonalstats.png. You can find a Powerpoint Template for workflows in the docs folder as well.
In order to create a good documentation of the processed data and the authors of the script we would like to ask for some minimal information in the Rmd
files consisting of
In order to be able to create a comprehensive and easy to analyse database from all preprocessing routines and datasets in the end, we suggest to have a final data-structure that is similar for each pre-processing routine. This datastructure should be a longtable
with three columns: wdpa_id, variable_name and value which will allow us to easily merge and analyse all pre-processed data-sets.
In the following section we provide an example for typical output data from a preprocessing routine and how this data could be transferred to a long table:
library(dplyr)
library(tidyr)
# create two examples for how output data could look like (classical tables in the "wide" format)
df<-tibble(wdpaid=c(1:10),
carbonflux_co2_balance_2010=c(21:30),
carbonflux_co2_balance_2020=c(21:30))
df2<-tibble(wdpaid=c(1:10),
teow_rainforest=c(21:30),
teow_savannah=c(21:30))
# have a look at them
df
df2
# pivot the table to longer
df_long<-
pivot_longer(df,cols = c(carbonflux_co2_balance_2010,carbonflux_co2_balance_2020))
df_long2<-
pivot_longer(df2,cols = c(teow_rainforest,teow_savannah))
# now we can bind both datasets
wholedatabase<-rbind(df_long,df_long2)
# and very easily use them for portfolio analysis.
# example to calculate how much savannah is in our protected areas
wholedatabase%>%
filter(name==c("teow_rainforest","teow_savannah"))%>%
group_by(name)%>%
summarise(sum=sum(value))
wholedatabase%>%
filter(name==c("carbonflux_co2_balance_2010","carbonflux_co2_balance_2020"))%>%
group_by(name)%>%
summarise(sum=sum(value))
In order to adopt an unambiguous data storage strategy, it would be great to have standard naming convention for all the datasets that would be used as either in raw form or saved as processed form. The naming convention should follow following structure:
repository_data-name_raw/processed.extension Example1: mapme.protectedareas_net-forest-carbon-flux-10N-060W_raw.tif *Example2: mapme.protectedareas_Terrestrial-Ecoregions-World_processed.shp
This section will be written as soon as we have enough working pre-processing scripts to chain them. The idea will be to do something along the lines of sourcing several preprocessing scripts with trycatch
to avoid interruption of chain processes.
This will be added later.
This repository offers common labels such as bug
or enhancement
. In addition it also has labels to create a priorization scheme to see which issues should be prioritized and adressed first. To that end we use a simplified method called MoSCoW which categorizes tasks into Must
, Should
, Could
and Won't
. Issues of category Must
are the most relevant and should be addressed first. After addressing all of these issues we will move forward to the Should
and if time left to the Could
. In addition there is a label called Fast Lane
which is used to mark such issues that should be addressed first within their given category. So an issue with Should
and Fast Lane
should be adressed quickly after all of the Musts
are processed.
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 whisker_0.4 knitr_1.30 magrittr_2.0.1
[5] workflowr_1.6.2 R6_2.5.0 rlang_0.4.11 fansi_0.5.0
[9] stringr_1.4.0 tools_3.6.3 xfun_0.20 utf8_1.2.1
[13] git2r_0.28.0 htmltools_0.5.1.1 ellipsis_0.3.2 rprojroot_2.0.2
[17] yaml_2.2.1 digest_0.6.27 tibble_3.1.1 lifecycle_1.0.0
[21] crayon_1.4.1 later_1.2.0 vctrs_0.3.8 fs_1.5.0
[25] promises_1.2.0.1 glue_1.4.2 evaluate_0.14 rmarkdown_2.6
[29] stringi_1.6.2 compiler_3.6.3 pillar_1.6.0 httpuv_1.6.1
[33] pkgconfig_2.0.3