Last updated: 2021-02-23

Checks: 7 0

Knit directory: globalIRmap/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200414)

The command set.seed(20200414) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 0919010

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 0919010. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    R/.Rhistory
    Ignored:    analysis/.Rhistory
    Ignored:    renv/library/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  .Rbuildignore
    Untracked:  .drake/
    Untracked:  .gitignore
    Untracked:  Compare_models_20201026.Rmd
    Untracked:  Rplots.pdf
    Untracked:  figtabres.docx
    Untracked:  figtabres_20201220_1.docx
    Untracked:  figtabres_20210216.docx
    Untracked:  log/
    Untracked:  schema.ini
    Untracked:  tabs_quick.Rmd
    Untracked:  tabs_quick.docx
    Untracked:  tabs_quick.html
    Untracked:  test.html

Unstaged changes:
    Modified:   analysis/about.Rmd
    Modified:   interactive.R
    Modified:   shinyapp/globalIRmap_gaugesel/app.R
    Modified:   shinyapp/globalIRmap_gaugesel/rsconnect/shinyapps.io/messamat/globalIRmap_gaugesel.dcf
    Modified:   shinyapp/globalIRmap_gaugesel/www/shinygdat.fst
    Modified:   workflowr_commands.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/methods_refdisdat.Rmd) and HTML (docs/methods_refdisdat.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	0919010	messamat	2021-02-23	wflow_publish(“analysis/methods_refdisdat.Rmd”)
html	59c783a	messamat	2021-01-08	Build site.
Rmd	10e5aa9	messamat	2021-01-08	wflow_publish(“analysis/methods_refdisdat.Rmd”)
html	844dd6b	messamat	2021-01-08	Build site.
Rmd	5d228a8	messamat	2021-01-08	wflow_publish(“analysis/methods_refdisdat.Rmd”)
html	8a77149	messamat	2021-01-08	Build site.
Rmd	97a066a	messamat	2021-01-08	wflow_publish(“analysis/methods_refdisdat.Rmd”)
html	47239d1	messamat	2021-01-08	Build site.
Rmd	0b9b22a	messamat	2021-01-08	wflow_publish(“analysis/methods_refdisdat.Rmd”)
html	24c68da	messamat	2021-01-08	Build site.
Rmd	3647cde	messamat	2021-01-08	wflow_publish(“analysis/methods_refdisdat.Rmd”)
Rmd	1da3ce4	messamat	2021-01-08	Continue developing app
html	f7011f5	messamat	2021-01-06	Build site.
Rmd	bb0df37	messamat	2021-01-06	Start formatting gauge selection for better display
html	9b48bde	messamat	2021-01-06	Build site.
Rmd	f1d9dcf	messamat	2021-01-06	Start building up workflowr website, start incorporating mandrake (but wait as very unstable still), plan gauge selection documentation
html	f1d9dcf	messamat	2021-01-06	Start building up workflowr website, start incorporating mandrake (but wait as very unstable still), plan gauge selection documentation

Source datasets

Two streamflow gauging station datasets were used as the source of training and cross-validation data for study models — the World Meteorological Organization Global Runoff Data Centre (GRDC) database (n ≈ 10,000) and a complementary subset of the Global Streamflow Indices and Metadata archive (GSIM, n ≈ 25,000), a compilation of twelve free-to-access national and international streamflow gauging station databases.
Whereas the GRDC offers daily water discharge values for most stations, GSIM only contains time series summary indices computed at the yearly, seasonal and monthly resolution (calculated from daily records whose open-access release is restricted for some of the compiled data sources). Therefore, we used the GRDC database as the core of our training/testing set and complemented it with a subset of streamflow gauging stations from GSIM.

A GSIM station was included only if:

it was not already part of the GRDC database, and
it included auxiliary information on the drainage area of the monitored reach (for reliably associating it to RiverATLAS), and
had a reported drainage area ≥ 5 km2 or a reported mean annual discharge ≥ 0.01 m3 s-1 (as RiverATLAS only includes rivers with a drainage area ≥ 10 km2 or mean annual discharge ≥ 0.1 m3 s-1 to which we added a margin of error), and
either:
- had a drainage area < 100 km2 or,
- was located on an IRES or,
- was located in a river basin which did not already contain a selected GRDC station (based on HydroBASINS Pfafstetter level 5 sub-basins, average area globally = 2.9 x 10^4 km2).

Processing and selecting gauging stations data

Assignment of river network location

We first linked the Global Runoff Data Center (GRDC) hydrometric dataset to the RiverATLAS global river network. We then complemented the GRDC dataset with a subset of Global Streamflow Indices and Metadata Archive (GSIM) gauging stations (Do et al., 2018; Gudmundsson et al., 2018).

The linkage of GRDC stations to RiverATLAS followed a three-step process: i) The stations were associated with the HydroSHEDS flow accumulation grids (Lehner et al., 2008) based on the procedure documented in GRDC report number 41 (Lehner, 2012): each station was automatically linked to the location within five kilometres around the original location reported in the GRDC database that optimized the agreement between the reported drainage area in the GRDC database and the modelled drainage area derived from HydroSHEDS, while limiting the distance from its original location. ii) Following this placement, only stations whose HydroSHEDS drainage area diverged by less than 5% from the reported GRDC areas were retained for subsequent steps. Each station was then associated with a river reach in the RiverATLAS river network (vector-based): - Each station was first associated with the nearest river reach (from the location determined in the previous step). - If the drainage area at the pour point of the reach differed by more than 10% from the reported drainage area in the GRDC database, we manually inspected, and corrected if appropriate, the location of the station (next step). iii) Manual inspection involved verifying that the reach a station had been associated with in the RiverATLAS digital river network corresponded to the actual waterway the station was located on, based on topographic and high-resolution satellite imagery (ESRI ArcGIS basemaps). If we could not ascertain the position of the gauging station, the GRDC-reported river and station names were verified in close vicinity (~10 km), exploiting the fact that station names often originate from nearby settlements, roads, or other geographic features. If a station could not be verified within this vicinity, the search was extended to within 50-100 km. If still no location was found that matched the river and/or station name, the station name was queried in search engines and online maps to see whether a location with this name existed. In all cases, the final decision on whether a station was moved to a new and “reliable” location depended on whether at least two out of the following four indicators could be matched reasonably well: a) location on corresponding waterway based on satellite imagery or topographic map; b) river name; b) station name and c) drainage area (match between reported GRDC value and modeled RiverATLAS value). Some final decisions were subjective as difficult combinations could arise (e.g., multiple agreements yet also disagreement(s) in the different indices).

Of the 6,543 GRDC stations with point coordinates that had daily streamflow data post-1961 (as of July 2014), 2,001 were removed in step (ii) and 25 were removed in step (iii), yielding a set of 4,517 stations (including 225 stations whose position was manually adjusted) that could be reliably matched with a river reach in RiverATLAS for subsequent analysis. Following this spatial pre-processing/filtering, we removed GRDC stations with less than 10 years of daily discharge data (excluding years with more than 20 missing days), yielding a final dataset of 3,748 stations.

We applied a similar process of spatial pre-processing to an initial subset of 4,080 GSIM stations (out of 30,959) which: - were not already part of the GRDC database, and - included auxiliary information on the size of the drainage area associated with the station, and - had at least 10 years of daily discharge data (excluding years with more than 20 missing days), and - had a reported drainage area ≥ 5 km2 or a reported mean annual discharge ≥ 0.01 m3 s-1 (as RiverATLAS only includes rivers with a drainage area ≥ 10 km2 or mean annual discharge ≥ 0.1 m3 s-1 to which we added a margin of error), and - either: o had a drainage area < 100 km2 or, o were located on an IRES or, o were located in a river basin which did not already contain a selected GRDC station (based on HydroBASINS Pfafstetter level 5 sub-basins, average area globally = 2.9 x 104 km2).

All GSIM stations had already been associated with the HydroSHEDS flow accumulation grids by Do et al. (2018) following the same procedure documented in Lehner (2012) and outlined above for GRDC stations (step i). Therefore, we directly associated all GSIM stations with a river reach in the RiverATLAS river network following the same approach as described for GRDC stations above (step ii). Given the diversity of original data sources compiled in the GSIM database, an additional level of caution was applied in linking GSIM stations to the network so that every station was manually inspected following the procedure described above for GRDC stations (step iii). We manually modified the position of 1,736 GSIM stations and removed 791, keeping 3,289 stations for subsequent analysis.

Following this spatial pre-processing, we also removed 179 stations located on the same RiverATLAS river reach as another station (keeping the station with the smallest difference in reported drainage area compared to that computed in RiverATLAS for the reach’s pour point). There was no instance of stations with different flow intermittence classes being associated with the same RiverATLAS reach. We also removed 719 stations whose degree of flow regulation (DOR) by upstream reservoirs (Lehner et al., 2011) exceeded 50%. The resulting dataset contained 6,162 stations.

Quality-checking of discharge information

A custom procedure was developed to ensure the quality of the streamflow time series (rather than the spatial location) associated with the gauging stations. The focus of this quality-checking procedure was to ensure the validity of zero-flow readings and the flow intermittence class assigned to each gauge (i.e., perennial or non-perennial). Zero-flow readings at streamflow gauging stations can stem from multiple circumstances. Usually, these readings reflect true river drying due to various natural or anthropogenic processes. However, river freezing, flow reversal (e.g., due to tidal influence), instrument malfunctioning, and data entry or processing errors are also common events that result in zero-flow readings in spite of the continued flow of water in the channel (Zimmer et al., 2020). Reported time series data may contain ambiguity between zero-flow and no-data entries, leading to potential underestimation of zero-flow (if masked as no-data). In addition, river diversions and reservoirs associated with dams can modify the flow intermittence of a monitored river reach from perennial to non-perennial (e.g., interrupting water flow as a single event during the initial filling of the reservoir; or periodic or permanent dewatering of the downstream channel due to water diversion) and vice versa (e.g., keeping a constant flow of water for hydroelectricity production). Ideally, each streamflow record would be accompanied by detailed information and flags describing the quality of individual daily values. However, this information is typically unavailable or difficult to access. Notably, the GRDC has stopped providing data quality flags in recent updates, and both the GRDC and the European Water Association (EWA) recommend users not to rely on existing quality flags (Gudmundsson et al., 2018).

For GSIM stations, a statistical quality-checking procedure was already performed by Gudmundsson et al. (2018) to flag suspect daily values and remove them prior to computing hydrological indices. For databases that provided reliable quality (QA/QC) flags, all flagging typology were translated to a common framework (see Table 1 in Gudmundsson et al., 2018) and suspect values were removed. For station records originating from databases that did not provide quality flags (or that recommended not to use them, i.e., EWA and GRDC), a statistical procedure was applied by Gudmundsson et al. (2018) to flag and remove suspect values:

Days with negative recorded discharge were flagged.
Daily values with more than 10 consecutive equal discharge values larger than zero were flagged. This rule is motivated by the fact that many days with consecutive streamflow values often occur due to instrument failure (e.g., damaged sensors, ice jams) or flow regulations.
Daily streamflow values (Q) were flagged if values of log (Q+0.01) are larger or smaller than the mean value of log (Q+0.01) plus or minus 6 times the standard deviation of log (Q+0.01) computed for that calendar day for the entire length of the series. The mean and standard deviation are computed for a 5-day window centred on the calendar day to ensure that enough data are considered.

See Gudmundsson et al. (2018), Gudmundsson & Seneviratne (2016), and Klein Tank et al. (2009) for the rationale behind these criteria. We used the same criteria to automatically flag individual daily streamflow values in the streamflow records of the GRDC gauging stations. However, rather than directly removing flagged daily streamflow values as was done for GSIM by Gudmundsson et al. (2018), the values that were flagged as being suspect were identified as such for subsequent visual inspection of GRDC streamflow records. Prior to visual inspection, we removed all stations for which only integer streamflow values were available, as any daily discharge value < 0.5 m3 s-1 is reported as zero by the data provider at these stations.

We visually inspected plots of the discharge time series for both GSIM and GRDC stations. For GRDC stations, we inspected time series of daily streamflow values with flagged suspect values. For GSIM stations, we inspected plots depicting the mean (± 2 SD), minimum and maximum monthly discharge, as daily streamflow records from GSIM stations are unavailable. In addition, as we did not have access to the daily streamflow records used in producing GSIM, none of the values flagged and removed by Gudmundsson et al. 2018 were available for our visual inspection. If unable to determine the validity of the record with reasonable confidence (often a subjective decision given the lack of auxiliary information), we erred on the side of caution, deemed the gauging station as unreliable, and excluded it from subsequent analysis.

When a station exhibited a flow regime that we suspected was caused by a flow regulating structure (e.g., a dam or reservoir), we inspected satellite and topographic imagery for the presence of a regulating structure upstream of the station and excluded the station if one was present. Indicators of flow regulation included abrupt changes in seasonality or decreases in the magnitude of peak or low flows, or signs of hydropeaking (i.e., short duration, high flow events at regular intervals), sometimes following a temporary dip in discharge (due to reservoir filling). See map below documenting every station that was removed, the reason for its removal, and associated time-series plots [interactive map on website].

Due to a pre-processing artefact in the production of GSIM by Gudmundsson et al. (2018), daily streamflow values for stations located in the U.S. had been rounded to two decimals, leading to very low discharge values (< 0.005 m3 s-1) being rounded to 0. Therefore, we made sure of the validity of zero-flow values for all U.S. stations which, according to GSIM records, had at least one zero-flow day per year on average (i.e., would be considered non-perennial in the subsequent analysis): we downloaded and computed flow intermittence statistics directly on the original daily discharge data from the United States Geological Survey (USGS). All stations with ≥ 1 zero-flow day per year according to GSIM but < 1 zero-flow day per year according to USGS data were excluded from further analysis.

An additional level of caution was used for stations on river reaches undergoing flow cessation exclusively in winter and for stations in the vicinity of a marine coastline, as instrument freeze-up and tidal flow reversal are both documented sources of anomalous zero-flow values (Zimmer et al., 2020). “Winter-only” non-perennial gauging stations were defined as those whose stream record contained less than one zero-flow day per year on average during months with long-term mean air temperature over 10°C (averaged across the local catchment immediately draining to the river reach, according to WorldClim 2). In other words, “winter-only” non-perennial gauging stations were those which would not have qualified as non-perennial according to our criterion if only non-winter months were taken into account. “Marine” stations were defined as those within 3 km of a coastline. For GSIM stations with visually suspect anomalous records (e.g., abrupt shift down to 0 m3 s-1 that may be driven by station freezing), we attempted to obtain original daily streamflow records from the original agencies whose data was used to produce GSIM if they were freely available online (e.g., from HYDAT in Canada or the USGS in the United States).

Following this statistical outlier detection and manual time series inspection, we conducted the rest of the analysis with 5,615 gauging stations for model training and cross-validation, with 3,967 perennial river reaches and 1,388 non-perennial reaches, with 45 and 34 years of daily streamflow data on average, respectively, across all continents (except Antarctica, see Extended Data Fig. 8 in Main Text for details on their environmental distribution).

Map of gauging stations included or exluded from the analysis

Loading of the map can take several minutes, please wait and read the instructions below

What does this show?

On this map are shown all global gauging stations that were reliably matched to the RiverATLAS global digital river network.

What do the colors mean?

Blue-colored points show perennial stations (< 1 zero-flow day per year on average across the record)
Red-colored points show non-perennial stations (>= 1 zero-flow day per year on average across the record)
Semi-transparent points show stations which were kept and included for the training and testing of predictive models.
Opaque points show stations which were excluded from further analysis due to one of the various reasons explained in the previous sections.

How do I use this map?

To pan, hold the left button and move the mouse.
To zoom in or out, use your mouse wheel (or the equivalent on a pad) or press the + and - symbols on the upper left of the map.
To add or remove a layer, hover your mouse over the stacked squares in the lower right of the map and click or unclick the corresponding check box. Any combination of blue-red and opaque-transparent points can be added or removed.
To get information on a station, hover with your mouse over the corresponding point:
- the first line shows which dataset this record comes from (GRDC or GSIM) and the corresponding uniquer identifier for this station.
- the second line shows whether this station was included in the analysis (“kept”), manually “inspected” for erroneous daily discharge values, or altogether “removed” from further analysis.
- the third line gives a brief explanation as to why the station was removed.
To see a hydrograph for a station, left-click the corresponding point. This feature is only available for stations with at least 10 years of valid data. Two types of graphs may be displayed:
- For a GRDC station: the graph shows the time series of daily streamflow values for that station, excluding calendar years with more than 20 missing daily records. The y-axis is square-root transformed. Individual points show daily values, blue lines link daily values (which may result in unusual patterns due to missing years), green points are non-zero flow daily values statistically flagged as potential outliers, red points are zero-flow flow values, black points are zero-flow values flagged as potential outliers.
- For a GSIM station: daily streamflow records from GSIM stations are unavailable. Therefore, the graph shows the mean (blue points), ± 2SD (light blue background shading), minimum and maximum monthly discharge (black points), excluding calendar years with more than 20 missing daily record. The y-axis is square-root transformed. Red points show minimum monthly discharge values equal to 0, purple points show months for which all daily discharge values are equal to 0.

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     rstudioapi_0.11  whisker_0.4      knitr_1.29      
 [5] magrittr_1.5     R6_2.4.1         rlang_0.4.10     highr_0.8       
 [9] stringr_1.4.0    tools_4.0.2      xfun_0.17        git2r_0.27.1    
[13] htmltools_0.5.0  ellipsis_0.3.0   rprojroot_1.3-2  yaml_2.2.1      
[17] digest_0.6.25    tibble_3.0.1     lifecycle_0.2.0  crayon_1.3.4    
[21] later_1.0.0      vctrs_0.3.4      promises_1.1.0   fs_1.5.0        
[25] glue_1.4.0       evaluate_0.14    rmarkdown_2.6    stringi_1.4.6   
[29] compiler_4.0.2   pillar_1.4.4     backports_1.1.10 httpuv_1.5.4    
[33] renv_0.9.3       pkgconfig_2.0.3

Reference streamflow gauging stations