Last updated: 2021-06-24
Checks: 6 1
Knit directory: globalIRmap/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish
to commit the R Markdown file and build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200414)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version cb6d0b9. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: R/.Rhistory
Ignored: analysis/.Rhistory
Ignored: renv/library/
Ignored: renv/staging/
Untracked files:
Untracked: .Rbuildignore
Untracked: .drake/
Untracked: .gitignore
Untracked: archived/
Untracked: figtabres.docx
Untracked: log/
Untracked: man/analyze_benchmark.Rd
Untracked: man/bin_misclass.Rd
Untracked: man/bind_gaugepreds.Rd
Untracked: man/change_oversamp.Rd
Untracked: man/change_tasktarget.Rd
Untracked: man/combine_bm.Rd
Untracked: man/comp_GRDCqstats.Rd
Untracked: man/comp_GSIMdurfreq.Rd
Untracked: man/comp_irstats.Rd
Untracked: man/compare_au.Rd
Untracked: man/compare_fr.Rd
Untracked: man/compare_us.Rd
Untracked: man/compute_IRpop.Rd
Untracked: man/compute_confustats.Rd
Untracked: man/create_baselearners.Rd
Untracked: man/create_tasks.Rd
Untracked: man/dynamic_resample.Rd
Untracked: man/dynamic_resamplebm.Rd
Untracked: man/eval_watergap.Rd
Untracked: man/extend_globalsummary_clz.Rd
Untracked: man/extrapolate_IRES.Rd
Untracked: man/extrapolate_networklength.Rd
Untracked: man/formatscales.Rd
Untracked: man/get_USdata.Rd
Untracked: man/get_basemapswintri.Rd
Untracked: man/ggenvhist.Rd
Untracked: man/gggaugeIPR.Rd
Untracked: man/gggauges.Rd
Untracked: man/ggmisclass_single.Rd
Untracked: man/ggpartialdep.Rd
Untracked: man/ggrivers.Rd
Untracked: man/ggvimp.Rd
Untracked: man/layout_ggenvhist.Rd
Untracked: man/make_gaugepreds.Rd
Untracked: man/map_basinBACC.Rd
Untracked: man/netpredformat.Rd
Untracked: man/plot_GRDCflags.Rd
Untracked: man/plot_GSIM.Rd
Untracked: man/qc_pnw.Rd
Untracked: man/rformat_network.Rd
Untracked: man/rsmp_pre.Rd
Untracked: man/select_features.Rd
Untracked: man/selecttrain_rf.Rd
Untracked: man/set_cvresampling.Rd
Untracked: man/set_tuning.Rd
Untracked: man/tabulate_benchmarks.Rd
Untracked: man/tabulate_globalsummary.Rd
Untracked: man/test_joincount.Rd
Untracked: man/test_thresholdsensitivity.Rd
Untracked: man/write_gaugepreds.Rd
Untracked: man/write_netpreds.Rd
Untracked: schema.ini
Unstaged changes:
Deleted: IntermittentAnalysis_MasterScript_reproduced.R
Modified: NAMESPACE
Modified: R/IRmapping_functions.R
Modified: R/IRmapping_packages.R
Modified: R/IRmapping_plan.R
Modified: R/IRmapping_plan_trimmed.R
Modified: analysis/methods_refdisdat.Rmd
Modified: analysis/methods_riveratlas.Rmd
Deleted: compare_classif_vimps.R
Deleted: data/README.md
Modified: figtabres.Rmd
Modified: man/analyzemerge_gaugeir.Rd
Modified: man/comp_GRDCdurfreq.Rd
Modified: man/convert_clastoregrtask.Rd
Modified: man/def_filestructure.Rd
Modified: man/format_gaugestats.Rd
Modified: man/get_oversamp_ratio.Rd
Modified: man/plotGSIMtimeseries.Rd
Modified: man/qc_onde.Rd
Modified: man/read_GRDCgauged_paths.Rd
Modified: man/read_GSIMgauged_paths.Rd
Modified: man/read_gaugep.Rd
Modified: man/readformatGSIMmon.Rd
Modified: man/readformatGSIMsea.Rd
Modified: man/selectformat_predvars.Rd
Modified: man/weighted_sd.Rd
Deleted: mandrake_workflow.R
Deleted: output/README.md
Modified: renv/activate.R
Deleted: tuning_timeanalysis.R
Deleted: workflowr_commands.R
Staged changes:
Modified: NAMESPACE
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/methods_refdisdat.Rmd
) and HTML (docs/methods_refdisdat.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 6e12f71 | messamat | 2021-06-14 | Build site. |
Rmd | 5e433c3 | messamat | 2021-06-14 | Publish new pages |
html | 21660a4 | messamat | 2021-02-23 | Build site. |
Rmd | 0919010 | messamat | 2021-02-23 | wflow_publish(“analysis/methods_refdisdat.Rmd”) |
html | 59c783a | messamat | 2021-01-08 | Build site. |
Rmd | 10e5aa9 | messamat | 2021-01-08 | wflow_publish(“analysis/methods_refdisdat.Rmd”) |
html | 844dd6b | messamat | 2021-01-08 | Build site. |
Rmd | 5d228a8 | messamat | 2021-01-08 | wflow_publish(“analysis/methods_refdisdat.Rmd”) |
html | 8a77149 | messamat | 2021-01-08 | Build site. |
Rmd | 97a066a | messamat | 2021-01-08 | wflow_publish(“analysis/methods_refdisdat.Rmd”) |
html | 47239d1 | messamat | 2021-01-08 | Build site. |
Rmd | 0b9b22a | messamat | 2021-01-08 | wflow_publish(“analysis/methods_refdisdat.Rmd”) |
html | 24c68da | messamat | 2021-01-08 | Build site. |
Rmd | 3647cde | messamat | 2021-01-08 | wflow_publish(“analysis/methods_refdisdat.Rmd”) |
Rmd | 1da3ce4 | messamat | 2021-01-08 | Continue developing app |
html | f7011f5 | messamat | 2021-01-06 | Build site. |
Rmd | bb0df37 | messamat | 2021-01-06 | Start formatting gauge selection for better display |
html | 9b48bde | messamat | 2021-01-06 | Build site. |
Rmd | f1d9dcf | messamat | 2021-01-06 | Start building up workflowr website, start incorporating mandrake (but wait as very unstable still), plan gauge selection documentation |
html | f1d9dcf | messamat | 2021-01-06 | Start building up workflowr website, start incorporating mandrake (but wait as very unstable still), plan gauge selection documentation |
In this section, we describe the reference streamflow data used in training our model of flow intermittence, including the source of these data (and how to obtain them), pre-processing steps for linking them to the global river network (RiverATLAS) and ensuring the quality and balance of these records. Finally, at the end of this section, you can find an interactive map of the streamflow gauging stations that were used in our analysis as well as those that were excluded from the analysis.
To access computing scripts associated with this part of our project, see the following repositories:
- globalIRmap_py: Python code to download GSIM and spatially format the datasets of gauging stations (among other tasks).
- globalIRmap: R code for QA/QCing and formatting gauging stations discharge records (among other tasks).
The specific scripts involved in the processing of gauging stations are mentioned more specifically throughout this section.
Two streamflow gauging station datasets were used as the source of training and cross-validation data for models — the World Meteorological Organization Global Runoff Data Centre (GRDC) database (n ≈ 10,000) and a complementary subset of the Global Streamflow Indices and Metadata archive (GSIM, n ≈ 31,000), a compilation of twelve free-to-access national and international streamflow gauging station databases.
Whereas the GRDC offers daily river discharge values for most stations, GSIM only contains time series summary indices computed at the yearly, seasonal and monthly resolution (calculated from daily records whose open-access release is restricted for some of the compiled data sources). Therefore, we used the GRDC database as the core of our training/testing set and complemented it with a subset of streamflow gauging stations from GSIM.
A GSIM station was included only if:
We first linked the Global Runoff Data Center (GRDC) hydrometric dataset to the RiverATLAS global river network. We then complemented the GRDC dataset with a subset of Global Streamflow Indices and Metadata Archive (GSIM) gauging stations (Do et al., 2018; Gudmundsson et al., 2018).
The linkage of GRDC stations to RiverATLAS followed a three-step process:
Of the 6,543 GRDC stations with point coordinates that had daily streamflow data post-1961 (as of July 2014), 2,001 were removed in step (ii) and 25 were removed in step (iii), yielding a set of 4,517 stations (including 225 stations whose position was manually adjusted) that could be reliably matched with a river reach in RiverATLAS for subsequent analysis. Following this spatial pre-processing/filtering, we removed GRDC stations with less than 10 years of daily discharge data (excluding years with more than 20 missing days), yielding a final dataset of 3,748 stations.
We applied a similar process of spatial pre-processing to an initial subset of 4,076 GSIM stations (out of 30,959) which: - were not already part of the GRDC database, and - included auxiliary information on the size of the drainage area associated with the station, and - had at least 10 years of daily discharge data (excluding years with more than 20 missing days), and - had a reported drainage area ≥ 5 km2 or a reported mean annual discharge ≥ 0.01 m3 s-1 (as RiverATLAS only includes rivers with a drainage area ≥ 10 km2 or mean annual discharge ≥ 0.1 m3 s-1 to which we added a margin of error), and - either: o had a drainage area < 100 km2 or, o were located on an IRES or, o were located in a river basin which did not already contain a selected GRDC station (based on HydroBASINS Pfafstetter level 5 sub-basins, average area globally = 2.9 x 104 km2).
All GSIM stations had already been associated with the HydroSHEDS flow accumulation grids by Do et al. (2018) following the same procedure documented in Lehner (2012) and outlined above for GRDC stations (step i). Therefore, we directly associated all GSIM stations with a river reach in the RiverATLAS river network following the same approach as described for GRDC stations above (step ii). Given the diversity of original data sources compiled in the GSIM database, an additional level of caution was applied in linking GSIM stations to the network so that every station was manually inspected following the procedure described above for GRDC stations (step iii). We manually modified the position of 1,736 GSIM stations and removed 791, keeping 3,284 stations for subsequent analysis.
Following this spatial pre-processing, we also removed 160 stations located on the same RiverATLAS river reach as another station (keeping the station with the smallest difference in reported drainage area compared to that computed in RiverATLAS for the reach’s pour point). There was no instance of stations with different flow intermittence classes being associated with the same RiverATLAS reach. We also removed 632 stations whose degree of flow regulation (DOR) by upstream reservoirs (Lehner et al., 2011) exceeded 50%. The resulting dataset at this point contained 6,240 stations.
A custom procedure was developed to ensure the quality of the streamflow time series (rather than the spatial location) associated with the gauging stations. The focus of this quality-checking procedure was to ensure the validity of zero-flow readings and the flow intermittence class assigned to each gauge (i.e., perennial or non-perennial). Zero-flow readings at streamflow gauging stations can stem from multiple circumstances. Usually, these readings reflect true river drying due to various natural or anthropogenic processes. However, river freezing, flow reversal (e.g., due to tidal influence), instrument malfunctioning, and data entry or processing errors are also common events that can result in zero-flow readings in spite of the continued flow of water in the channel (Zimmer et al., 2020). Reported time series data may contain ambiguity between zero-flow and no-data entries, leading to potential underestimation of zero-flow (if masked as no-data). In addition, river diversions and reservoirs associated with dams can modify the flow intermittence of a monitored river reach from perennial to non-perennial (e.g., interrupting water flow as a single event during the initial filling of the reservoir; periodic to permanent dewatering of the downstream channel due to water diversion) and vice versa (e.g., keeping a constant flow of water for hydroelectricity production). Ideally, each streamflow record would be accompanied by detailed information and flags describing the quality of individual daily values. However, this information is typically unavailable or difficult to access. Notably, the GRDC has stopped providing data quality flags in recent updates, and both the GRDC and the European Water Association (EWA) recommend users not to rely on existing quality flags (Gudmundsson et al., 2018).
For GSIM stations, a statistical quality-checking procedure was already performed by Gudmundsson et al. (2018) to flag suspect daily values and remove them prior to computing hydrological indices. For databases that provided reliable quality (QA/QC) flags, all flagging typologies were translated to a common framework (see Table 1 in Gudmundsson et al., 2018) and suspect values were removed.
For station records originating from databases that did not provide quality flags (or that recommended not to use them, i.e., EWA and GRDC), a statistical procedure was applied by Gudmundsson et al. (2018) to flag and remove suspect values with the following characteristics: * Days with negative recorded discharge.
* Daily values with more than 10 consecutive equal discharge values larger than zero. This rule is motivated by the fact that many days with consecutive streamflow values often occur due to instrument failure (e.g., damaged sensors, ice jams) or flow regulations. * Daily streamflow values (Q) if log (Q+0.01) was larger or smaller than the mean value of log (Q+0.01) plus or minus 6 times the standard deviation of log (Q+0.01) computed for that calendar day for the entire length of the series. The mean and standard deviation are computed for a 5-day window centred on the calendar day to ensure that enough data are considered. See Gudmundsson et al. (2018), Gudmundsson & Seneviratne (2016), and Klein Tank et al. (2009) for the rationale behind these criteria.
We used the same criteria to automatically flag individual daily streamflow values in the streamflow records of the GRDC gauging stations. However, rather than directly removing flagged daily streamflow values, as was done for GSIM by Gudmundsson et al. (2018), the values that we flagged as being suspect were further investigated through a visual inspection of the gauges’ streamflow records. Prior to visual inspection, we removed all stations for which only integer streamflow values were available, as any daily discharge value < 0.5 m3 s-1 is reported as zero by the data provider at these stations.
When a station exhibited a flow regime that we suspected was caused by a flow regulating structure (e.g., a dam or reservoir), we inspected satellite and topographic imagery for the presence of a regulating structure upstream of the station and excluded the station if one was present. Indicators of flow regulation included abrupt changes in seasonality or decreases in the magnitude of peak- or low-flows, signs of hydropeaking (i.e., short duration, high flow events at regular intervals), sometimes following a temporary dip in discharge (due to reservoir filling). See our online research compendium (https://messamat.github.io/globalIRmap/) for an interactive visualization of processing information for every gauging station that was removed, including the reason for its removal and associated time-series plots.
Due to a pre-processing artefact in the production of GSIM by Gudmundsson et al. (2018), daily streamflow values for stations located in the U.S. had been rounded to two decimals, leading to very low discharge values (< 0.005 m3 s-1) being rounded to 0. Therefore, we made sure of the validity of zero-flow values for all U.S. stations which, according to GSIM records, had at least one zero-flow day per year on average (i.e., would be considered non-perennial in the subsequent analysis): we downloaded and computed flow intermittence statistics directly from the original daily discharge data provided by the United States Geological Survey (USGS). All stations with ≥ 1 zero-flow day per year according to GSIM but < 1 zero-flow day per year according to USGS data were excluded from further analysis.
An additional level of caution was used for stations on river reaches undergoing flow cessation exclusively in winter and for stations in the vicinity of a marine coastline, as instrument freeze-up and tidal flow reversal are both documented sources of anomalous zero-flow values (Zimmer et al., 2020). “Winter-only” non-perennial gauging stations were defined as those whose stream record contained less than one zero-flow day per year on average during months with long-term mean air temperature over 10°C (averaged across the local catchment immediately draining to the river reach, according to WorldClim 2, Fick & Hijmans, 2017). In other words, “winter-only” non-perennial gauging stations were those which would not have qualified as non-perennial according to our criterion if only non-winter months were taken into account. “Marine” stations were defined as those within 3 km of a coastline. For GSIM stations with visually suspect anomalous records (e.g., abrupt shift down to 0 m3 s-1 that may be driven by station freezing), we attempted to obtain original daily streamflow records from the original agencies whose data was used to produce GSIM if they were freely available online (e.g., from HYDAT in Canada or USGS in the United States).
Following this statistical outlier detection and manual time series inspection, we excluded 625 suspicious gauging stations and conducted the rest of the analysis with 5,615 gauging stations for model training and cross-validation, which represented a wide range of river types found on Earth (except Antarctica, see Extended Data Fig. 8 in Main Text for details on their environmental distribution).
On this map are shown all global gauging stations that were reliably matched to the RiverATLAS global digital river network.
Blue-colored points show perennial stations (< 1 zero-flow day per year on average across the record)
Red-colored points show non-perennial stations (>= 1 zero-flow day per year on average across the record)
Semi-transparent points show stations which were kept and included for the training and testing of predictive models.
Opaque points show stations which were excluded from further analysis due to one of the various reasons explained in the previous sections.
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] workflowr_1.6.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 rstudioapi_0.13 whisker_0.4 knitr_1.29
[5] magrittr_1.5 R6_2.4.1 rlang_0.4.10 fansi_0.4.1
[9] highr_0.9 stringr_1.4.0 tools_4.0.2 xfun_0.24
[13] utf8_1.1.4 git2r_0.27.1 htmltools_0.5.0 ellipsis_0.3.2
[17] rprojroot_1.3-2 yaml_2.2.1 digest_0.6.25 tibble_3.1.1
[21] lifecycle_0.2.0 crayon_1.3.4 later_1.2.0 vctrs_0.3.8
[25] promises_1.2.0.1 fs_1.5.0 glue_1.4.0 evaluate_0.14
[29] rmarkdown_2.7 stringi_1.4.6 compiler_4.0.2 pillar_1.6.1
[33] backports_1.1.10 httpuv_1.5.4 renv_0.9.3 pkgconfig_2.0.3