Last updated: 2021-06-24

Checks: 7 0

Knit directory: globalIRmap/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200414) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 0e72638. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:

Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    R/.Rhistory
    Ignored:    analysis/.Rhistory
    Ignored:    renv/library/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  .drake/
    Untracked:  .gitignore
    Untracked:  figtabres.docx
    Untracked:  schema.ini

Unstaged changes:
    Modified:   log/drake.log
    Deleted:    output/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/methods_refdisdat.Rmd) and HTML (docs/methods_refdisdat.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 0e72638 messamat 2021-06-24 wflow_publish(c(“analysis/about.Rmd”, “analysis/index.Rmd”, “analysis/license.Rmd”,
Rmd 8ad58ab messamat 2021-06-24 Add documentation, format project for package building
html 8ad58ab messamat 2021-06-24 Add documentation, format project for package building
html 6e12f71 messamat 2021-06-14 Build site.
Rmd 5e433c3 messamat 2021-06-14 Publish new pages
html 21660a4 messamat 2021-02-23 Build site.
Rmd 0919010 messamat 2021-02-23 wflow_publish(“analysis/methods_refdisdat.Rmd”)
html 59c783a messamat 2021-01-08 Build site.
Rmd 10e5aa9 messamat 2021-01-08 wflow_publish(“analysis/methods_refdisdat.Rmd”)
html 844dd6b messamat 2021-01-08 Build site.
Rmd 5d228a8 messamat 2021-01-08 wflow_publish(“analysis/methods_refdisdat.Rmd”)
html 8a77149 messamat 2021-01-08 Build site.
Rmd 97a066a messamat 2021-01-08 wflow_publish(“analysis/methods_refdisdat.Rmd”)
html 47239d1 messamat 2021-01-08 Build site.
Rmd 0b9b22a messamat 2021-01-08 wflow_publish(“analysis/methods_refdisdat.Rmd”)
html 24c68da messamat 2021-01-08 Build site.
Rmd 3647cde messamat 2021-01-08 wflow_publish(“analysis/methods_refdisdat.Rmd”)
Rmd 1da3ce4 messamat 2021-01-08 Continue developing app
html f7011f5 messamat 2021-01-06 Build site.
Rmd bb0df37 messamat 2021-01-06 Start formatting gauge selection for better display
html 9b48bde messamat 2021-01-06 Build site.
Rmd f1d9dcf messamat 2021-01-06 Start building up workflowr website, start incorporating mandrake (but wait as very unstable still), plan gauge selection documentation
html f1d9dcf messamat 2021-01-06 Start building up workflowr website, start incorporating mandrake (but wait as very unstable still), plan gauge selection documentation

In this section, we describe the reference streamflow data used in training our model of flow intermittence, including the source of these data (and how to obtain them), pre-processing steps for linking them to the global river network (RiverATLAS) and ensuring the quality and balance of these records. Finally, at the end of this section, you can find an interactive map of the streamflow gauging stations that were used in our analysis as well as those that were excluded from the analysis.

To access computing scripts associated with this part of our project, see the following repositories:
- globalIRmap_py: Python code to download GSIM and spatially format the datasets of gauging stations (among other tasks).
- globalIRmap: R code for QA/QCing and formatting gauging stations discharge records (among other tasks).
The specific scripts involved in the processing of gauging stations are mentioned more specifically throughout this section. See the Getting started and Workflow pages of this website for more information on project management.

Source datasets


Two streamflow gauging station datasets were used as the source of training and cross-validation data for models — the World Meteorological Organization Global Runoff Data Centre (GRDC) database (n ≈ 10,000) and a complementary subset of the Global Streamflow Indices and Metadata archive (GSIM, n ≈ 31,000), a compilation of twelve free-to-access national and international streamflow gauging station databases.
Whereas the GRDC offers daily river discharge values for most stations, GSIM only contains time series summary indices computed at the yearly, seasonal and monthly resolution (calculated from daily records whose open-access release is restricted for some of the compiled data sources). Therefore, we used the GRDC database as the core of our training/testing set and complemented it with a subset of streamflow gauging stations from GSIM.

A GSIM station was included only if:

  1. it was not already part of the GRDC database, and
  2. it included auxiliary information on the drainage area of the monitored reach (for reliably associating it to RiverATLAS), and
  3. had a reported drainage area ≥ 5 km2 or a reported mean annual discharge ≥ 0.01 m3 s-1 (as RiverATLAS only includes rivers with a drainage area ≥ 10 km2 or mean annual discharge ≥ 0.1 m3 s-1 to which we added a margin of error), and
  4. either:
    • had a drainage area < 100 km2 or,
    • was located on an IRES or,
    • was located in a river basin which did not already contain a selected GRDC station (based on HydroBASINS Pfafstetter level 5 sub-basins, average area globally = 2.9 x 10^4 km2).

Obtaining source data

Subsets of the GRDC dataset can now be downloaded through their interactive platform. To download data for all stations globally, however, one make a written request to grdc(at)

GSIM can be manually downloaded from the pangaea repositories or by running

Processing and selecting gauging stations data

Assignment of river network location

We first linked the Global Runoff Data Center (GRDC) hydrometric dataset to the RiverATLAS global river network. We then complemented the GRDC dataset with a subset of Global Streamflow Indices and Metadata Archive (GSIM) gauging stations (Do et al., 2018; Gudmundsson et al., 2018).

The linkage of GRDC stations to RiverATLAS followed a three-step process:

  1. The stations were associated with the HydroSHEDS flow accumulation grids (Lehner et al., 2008) based on the procedure documented in GRDC Report number 41 (Lehner, 2012): each station was automatically linked to the location within five kilometres around the original location reported in the GRDC database that optimized the agreement between the reported drainage area in the GRDC database and the modelled drainage area derived from HydroSHEDS, while limiting the distance from its original location.
  2. Following this placement, only stations whose HydroSHEDS drainage area diverged by less than 5% from the reported GRDC areas were retained for subsequent steps. Each station was then associated with a river reach in the RiverATLAS river network (vector-based):
  • Each station was first associated with the nearest river reach (from the location determined in the previous step).
  • If the drainage area at the pour point of the reach differed by more than 10% from the reported drainage area in the GRDC database, we manually inspected, and corrected if appropriate, the location of the station (see step iii).
  1. Manual inspection involved verifying that the reach a station had been associated with in the RiverATLAS digital river network corresponded to the actual waterway the station was located on, based on topographic and high-resolution satellite imagery (ESRI ArcGIS basemaps). If we could not ascertain the position of the gauging station, the GRDC-reported river and station names were verified in close vicinity (~10 km), exploiting the fact that station names often originate from nearby settlements, roads, or other geographic features. If a station could not be verified within this vicinity, the search was extended to within 50-100 km. If still no location was found that matched the river and/or station name, the station name was queried in search engines and online maps to see whether a location with this name existed. In all cases, the final decision on whether a station was moved to a new and “reliable” location depended on whether at least two out of the following four indicators could be matched reasonably well: a) location on corresponding waterway based on satellite imagery or topographic map; b) river name; c) station name and d) drainage area (match between reported GRDC value and modeled RiverATLAS value). Some final decisions were subjective as difficult combinations could arise (e.g., multiple agreements yet also disagreement(s) in the different indices).

Of the 6,543 GRDC stations with point coordinates that had daily streamflow data post-1961 (as of July 2014), 2,001 were removed in step (ii) and 25 were removed in step (iii), yielding a set of 4,517 stations (including 225 stations whose position was manually adjusted) that could be reliably matched with a river reach in RiverATLAS for subsequent analysis. Following this spatial pre-processing/filtering, we removed GRDC stations with less than 10 years of daily discharge data (excluding years with more than 20 missing days), yielding a final dataset of 3,748 stations.

We applied a similar process of spatial pre-processing to an initial subset of 4,076 GSIM stations (out of 30,959) which: - were not already part of the GRDC database, and - included auxiliary information on the size of the drainage area associated with the station, and - had at least 10 years of daily discharge data (excluding years with more than 20 missing days), and - had a reported drainage area ≥ 5 km2 or a reported mean annual discharge ≥ 0.01 m3 s-1 (as RiverATLAS only includes rivers with a drainage area ≥ 10 km2 or mean annual discharge ≥ 0.1 m3 s-1 to which we added a margin of error), and - either: o had a drainage area < 100 km2 or, o were located on an IRES or, o were located in a river basin which did not already contain a selected GRDC station (based on HydroBASINS Pfafstetter level 5 sub-basins, average area globally = 2.9 x 104 km2).

All GSIM stations had already been associated with the HydroSHEDS flow accumulation grids by Do et al. (2018) following the same procedure documented in Lehner (2012) and outlined above for GRDC stations (step i). Therefore, we directly associated all GSIM stations with a river reach in the RiverATLAS river network following the same approach as described for GRDC stations above (step ii). Given the diversity of original data sources compiled in the GSIM database, an additional level of caution was applied in linking GSIM stations to the network so that every station was manually inspected following the procedure described above for GRDC stations (step iii). We manually modified the position of 1,736 GSIM stations and removed 791, keeping 3,284 stations for subsequent analysis.

Following this spatial pre-processing, we also removed 160 stations located on the same RiverATLAS river reach as another station (keeping the station with the smallest difference in reported drainage area compared to that computed in RiverATLAS for the reach’s pour point). There was no instance of stations with different flow intermittence classes being associated with the same RiverATLAS reach. We also removed 632 stations whose degree of flow regulation (DOR) by upstream reservoirs (Lehner et al., 2011) exceeded 50%. The resulting dataset at this point contained 6,240 stations.

Script for assignment of river network location:

Quality-checking of discharge information

A custom procedure was developed to ensure the quality of the streamflow time series (rather than the spatial location) associated with the gauging stations. The focus of this quality-checking procedure was to ensure the validity of zero-flow readings and the flow intermittence class assigned to each gauge (i.e., perennial or non-perennial). Zero-flow readings at streamflow gauging stations can stem from multiple circumstances. Usually, these readings reflect true river drying due to various natural or anthropogenic processes. However, river freezing, flow reversal (e.g., due to tidal influence), instrument malfunctioning, and data entry or processing errors are also common events that can result in zero-flow readings in spite of the continued flow of water in the channel (Zimmer et al., 2020). Reported time series data may contain ambiguity between zero-flow and no-data entries, leading to potential underestimation of zero-flow (if masked as no-data). In addition, river diversions and reservoirs associated with dams can modify the flow intermittence of a monitored river reach from perennial to non-perennial (e.g., interrupting water flow as a single event during the initial filling of the reservoir; periodic to permanent dewatering of the downstream channel due to water diversion) and vice versa (e.g., keeping a constant flow of water for hydroelectricity production). Ideally, each streamflow record would be accompanied by detailed information and flags describing the quality of individual daily values. However, this information is typically unavailable or difficult to access. Notably, the GRDC has stopped providing data quality flags in recent updates, and both the GRDC and the European Water Association (EWA) recommend users not to rely on existing quality flags (Gudmundsson et al., 2018).

For GSIM stations, a statistical quality-checking procedure was already performed by Gudmundsson et al. (2018) to flag suspect daily values and remove them prior to computing hydrological indices. For databases that provided reliable quality (QA/QC) flags, all flagging typologies were translated to a common framework (see Table 1 in Gudmundsson et al., 2018) and suspect values were removed.

For station records originating from databases that did not provide quality flags (or that recommended not to use them, i.e., EWA and GRDC), a statistical procedure was applied by Gudmundsson et al. (2018) to flag and remove suspect values with the following characteristics: * Days with negative recorded discharge.
* Daily values with more than 10 consecutive equal discharge values larger than zero. This rule is motivated by the fact that many days with consecutive streamflow values often occur due to instrument failure (e.g., damaged sensors, ice jams) or flow regulations. * Daily streamflow values (Q) if log (Q+0.01) was larger or smaller than the mean value of log (Q+0.01) plus or minus 6 times the standard deviation of log (Q+0.01) computed for that calendar day for the entire length of the series. The mean and standard deviation are computed for a 5-day window centred on the calendar day to ensure that enough data are considered. See Gudmundsson et al. (2018), Gudmundsson & Seneviratne (2016), and Klein Tank et al. (2009) for the rationale behind these criteria.

We used the same criteria to automatically flag individual daily streamflow values in the streamflow records of the GRDC gauging stations. However, rather than directly removing flagged daily streamflow values, as was done for GSIM by Gudmundsson et al. (2018), the values that we flagged as being suspect were further investigated through a visual inspection of the gauges’ streamflow records. Prior to visual inspection, we removed all stations for which only integer streamflow values were available, as any daily discharge value < 0.5 m3 s-1 is reported as zero by the data provider at these stations.

When a station exhibited a flow regime that we suspected was caused by a flow regulating structure (e.g., a dam or reservoir), we inspected satellite and topographic imagery for the presence of a regulating structure upstream of the station and excluded the station if one was present. Indicators of flow regulation included abrupt changes in seasonality or decreases in the magnitude of peak- or low-flows, signs of hydropeaking (i.e., short duration, high flow events at regular intervals), sometimes following a temporary dip in discharge (due to reservoir filling). See our online research compendium ( for an interactive visualization of processing information for every gauging station that was removed, including the reason for its removal and associated time-series plots.

Due to a pre-processing artefact in the production of GSIM by Gudmundsson et al. (2018), daily streamflow values for stations located in the U.S. had been rounded to two decimals, leading to very low discharge values (< 0.005 m3 s-1) being rounded to 0. Therefore, we made sure of the validity of zero-flow values for all U.S. stations which, according to GSIM records, had at least one zero-flow day per year on average (i.e., would be considered non-perennial in the subsequent analysis): we downloaded and computed flow intermittence statistics directly from the original daily discharge data provided by the United States Geological Survey (USGS). All stations with ≥ 1 zero-flow day per year according to GSIM but < 1 zero-flow day per year according to USGS data were excluded from further analysis.

An additional level of caution was used for stations on river reaches undergoing flow cessation exclusively in winter and for stations in the vicinity of a marine coastline, as instrument freeze-up and tidal flow reversal are both documented sources of anomalous zero-flow values (Zimmer et al., 2020). “Winter-only” non-perennial gauging stations were defined as those whose stream record contained less than one zero-flow day per year on average during months with long-term mean air temperature over 10°C (averaged across the local catchment immediately draining to the river reach, according to WorldClim 2, Fick & Hijmans, 2017). In other words, “winter-only” non-perennial gauging stations were those which would not have qualified as non-perennial according to our criterion if only non-winter months were taken into account. “Marine” stations were defined as those within 3 km of a coastline. For GSIM stations with visually suspect anomalous records (e.g., abrupt shift down to 0 m3 s-1 that may be driven by station freezing), we attempted to obtain original daily streamflow records from the original agencies whose data was used to produce GSIM if they were freely available online (e.g., from HYDAT in Canada or USGS in the United States).

Following this statistical outlier detection and manual time series inspection, we excluded 625 suspicious gauging stations and conducted the rest of the analysis with 5,615 gauging stations for model training and cross-validation, which represented a wide range of river types found on Earth (except Antarctica, see Extended Data Fig. 8 in Main Text for details on their environmental distribution).

Script for QA/QCing: globalIRmap repository, see Getting Started and Workflow tab of this website for more information.

Map of gauging stations included or exluded from the analysis

Loading of the map can take several minutes, please wait and read the instructions below

What does this show?

On this map are shown all global gauging stations that were reliably matched to the RiverATLAS global digital river network.

What do the colors mean?

  • Blue-colored points show perennial stations (< 1 zero-flow day per year on average across the record)

  • Red-colored points show non-perennial stations (>= 1 zero-flow day per year on average across the record)

  • Semi-transparent points show stations which were kept and included for the training and testing of predictive models.

  • Opaque points show stations which were excluded from further analysis due to one of the various reasons explained in the previous sections.

How do I use this map?

  • To pan, hold the left button and move the mouse.
  • To zoom in or out, use your mouse wheel (or the equivalent on a pad) or press the + and - symbols on the upper left of the map.
  • To add or remove a layer, hover your mouse over the stacked squares in the lower right of the map and click or unclick the corresponding check box. Any combination of blue-red and opaque-transparent points can be added or removed.
  • To get information on a station, hover with your mouse over the corresponding point:
    • the first line shows which dataset this record comes from (GRDC or GSIM) and the corresponding uniquer identifier for this station.
    • the second line shows whether this station was included in the analysis (“kept”), manually “inspected” for erroneous daily discharge values, or altogether “removed” from further analysis.
    • the third line gives a brief explanation as to why the station was removed.
  • To see a hydrograph for a station, left-click the corresponding point. This feature is only available for stations with at least 10 years of valid data. Two types of graphs may be displayed:
    • For a GRDC station: the graph shows the time series of daily streamflow values for that station, excluding calendar years with more than 20 missing daily records. The y-axis is square-root transformed. Individual points show daily values, blue lines link daily values (which may result in unusual patterns due to missing years), green points are non-zero flow daily values statistically flagged as potential outliers, red points are zero-flow flow values, black points are zero-flow values flagged as potential outliers.
    • For a GSIM station: daily streamflow records from GSIM stations are unavailable. Therefore, the graph shows the mean (blue points), ± 2SD (light blue background shading), minimum and maximum monthly discharge (black points), excluding calendar years with more than 20 missing daily record. The y-axis is square-root transformed. Red points show minimum monthly discharge values equal to 0, purple points show months for which all daily discharge values are equal to 0.

Source code for the app

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     rstudioapi_0.13  whisker_0.4      knitr_1.29      
 [5] magrittr_1.5     R6_2.4.1         rlang_0.4.10     fansi_0.4.1     
 [9] highr_0.9        stringr_1.4.0    tools_4.0.2      xfun_0.24       
[13] utf8_1.1.4       git2r_0.27.1     htmltools_0.5.0  ellipsis_0.3.2  
[17] rprojroot_1.3-2  yaml_2.2.1       digest_0.6.25    tibble_3.1.1    
[21] lifecycle_0.2.0  crayon_1.3.4     later_1.2.0      vctrs_0.3.8     
[25] promises_1.2.0.1 fs_1.5.0         glue_1.4.0       evaluate_0.14   
[29] rmarkdown_2.7    stringi_1.4.6    compiler_4.0.2   pillar_1.6.1    
[33] backports_1.1.10 httpuv_1.5.4     renv_0.9.3       pkgconfig_2.0.3