count_observations

Last updated: 2021-10-14

Checks: 7 0

Knit directory: bgc_argo_r_argodata/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20211008)

The command set.seed(20211008) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 6a2a266

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 6a2a266. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  code/creating_dataframe.R
    Untracked:  code/creating_map.R

Unstaged changes:
    Modified:   analysis/_site.yml
    Modified:   code/Workflowr_project_managment.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/count_observations.Rmd) and HTML (docs/count_observations.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	5331669	pasqualina-vonlanthendinenna	2021-10-13	Build site.
Rmd	d8616e1	pasqualina-vonlanthendinenna	2021-10-13	added timeseries of all 3 bgc variables
html	795b5ad	pasqualina-vonlanthendinenna	2021-10-13	Build site.
Rmd	81f5ac9	pasqualina-vonlanthendinenna	2021-10-13	added timeseries of all 3 bgc variables
html	4840e49	pasqualina-vonlanthendinenna	2021-10-12	Build site.
Rmd	2fb35f7	pasqualina-vonlanthendinenna	2021-10-12	added reading data in page

Count the number of bgc-argo observations, and plot the evolution over time.

Use the same process as in loading_data.Rmd to load in bgc-argo data and the corresponding metadata:

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.3     ✓ dplyr   1.0.5
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

# remotes::install_github("ArgoCanada/argodata")
library(argodata)
library(ggplot2)
library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

argo_set_cache_dir('/nfs/kryo/work/updata/bgc_argo_r_argodata')

argo_update_global(max_global_cache_age = Inf)  
argo_update_data(max_data_cache_age = Inf) # the arguments max_global_cache_age and max_data_cache_age indicate the age of the cached files to update (in hours) (Inf means always use the cached file, and -Inf means always download from the server) 
# e.g. if max_global_cache_age = 5, then files older than 5 hours will be updated 


bgc_subset = argo_global_synthetic_prof() %>%     
  argo_filter_data_mode(data_mode = 'delayed') %>%
  argo_filter_date(date_min = '2013-01-01',
                   date_max = '2015-12-31') # download bgc-argo files containing delayed-mode data (recommended for bgc variables) between January 1, 2013 and December 31, 2015 (selects this specific subset of the cached files)

Loading argo_global_synthetic_prof()

# check the dates 
# max(bgc_subset$date, na.rm = TRUE)
# min(bgc_subset$date, na.rm = TRUE)

bgc_data = argo_prof_levels(bgc_subset, 
                            vars = c('PRES_ADJUSTED','PRES_ADJUSTED_QC',
                                     'PSAL_ADJUSTED', 'PSAL_ADJUSTED_QC',
                                     'TEMP_ADJUSTED','TEMP_ADJUSTED_QC',
                                     'DOXY_ADJUSTED', 'DOXY_ADJUSTED_QC',
                                     'NITRATE_ADJUSTED', 'NITRATE_ADJUSTED_QC',
                                     'PH_IN_SITU_TOTAL_ADJUSTED', 'PH_IN_SITU_TOTAL_ADJUSTED_QC'), quiet = TRUE) 
# read in the profiles of the delayed-mode data from 01/01/2013 to 31/12/2015 (takes a while)

bgc_metadata = argo_prof_prof(bgc_subset) # read in the metadata corresponding to these profiles

Extracting from 54117 files

full_data = left_join(bgc_data, bgc_metadata, by = c('file', 'n_prof')) # join data and metadata together

Create a separate dataframe for each BGC variable (oxygen, pH and nitrate), with longitude, latitude, value, qc flag, date, time, year, month, day, cycle number, float ID, and profile qc flag, and look at the evolution of the number of observations (any depth levels) and the number of profiles (depth levels 0-2000 m) over time.

QC flags for values (‘flag’ column) are between 1 and 8, where: 1 is ‘good’ data, 2 is ‘probably good’ data, 3 is ‘probably bad’ data, 4 is ‘bad’ data, 5 is ‘value changed’, 6 is not used, 7 is not used, 8 is ‘estimated value’, 9 is ‘missing value’.

Profile QC flags (‘profile_flag’ column) are QC codes attributed to the entire profile, and indicate the number of depth levels (in %) where the value is considered to be good data (QC flags of 1, 2, 5, and 8):
‘A’ means 100% of profile levels contain good data, ‘B’ means 75-<100% of profile levels contain good data, ‘C’ means 50-75% of profile levels contain good data, ‘D’ means 25-50% of profile levels contain good data, ‘E’ means >0-50% of profile levels contain good data, ‘F’ means 0% of profile levels contain good data.

NUMBER OF OBSERVATIONS

1.1 Oxygen

oxy = data.frame(full_data$longitude, full_data$latitude, full_data$date, full_data$doxy_adjusted, full_data$doxy_adjusted_qc) # extract desired variables from the combined data/metadata dataframe

colnames(oxy) = c('longitude','latitude','date', 'doxy_adjusted', 'flag') # add column names 
oxy$date.simple = as.Date(oxy$date)        # separate the date and time into two columns
oxy$time = format(oxy$date, '%H:%M:%S')

oxy = oxy %>%                             # separate the date into year, month and day
  mutate(year = year(date.simple),
         month = month(date.simple),
         day = day(date.simple),
         cycle = full_data$cycle_number,   
         float_ID = full_data$float_serial_no,
         profile_flag = full_data$profile_doxy_qc) # add cycle number, float ID and profile qc flag to the dataframe 

oxy.no.na = oxy %>%
  filter(!is.na(doxy_adjusted)) # remove NA values

num_obs_oxy = oxy.no.na %>%
  group_by(year, month, flag) %>%     
  summarise(Count = n()) # count the number of oxygen observations by month

`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.

The number of observations can then be plotted as a timeseries. Different colored lines are plotted for different quality control flags

# plot the number of oxygen observations per day 
ggplot(num_obs_oxy, aes(x = month, y = Count, group = flag, col = flag)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ year) +
  labs(title = 'number of adjusted oxygen observations',
       y = 'number of observations',
       x = 'date') +
  theme_bw()

Version	Author	Date
4840e49	pasqualina-vonlanthendinenna	2021-10-12

1.2 pH

ph = data.frame(full_data$longitude, full_data$latitude, full_data$date, full_data$ph_in_situ_total_adjusted, full_data$ph_in_situ_total_adjusted_qc)
# extract the variables from the combined data/metadata frame

colnames(ph) = c('longitude','latitude','date', 'ph_in_situ_total', 'flag')
# rename columns 

ph$date.simple = as.Date(ph$date)
ph$time = format(ph$date, '%H:%M:%S') # separate date and time into two separate columns

ph = ph %>%
  mutate(year = year(date.simple),
         month = month(date.simple),
         day = day(date.simple),
         cycle = full_data$cycle_number,
         float_ID = full_data$float_serial_no,
         profile_flag = full_data$profile_ph_in_situ_total_qc) # separate year, month and day, and add cycle number, float ID and profile qc flag 

ph.no.na = ph %>%
  filter(!is.na(ph_in_situ_total)) # remove NA values 

num_obs_ph = ph.no.na %>%
  group_by(year, month, flag) %>%
  summarise(Count = n())   # count the number of observations per month

`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.

We can then plot the number of pH observations in a timeseries. Different lines are plotted for different quality control flags

ggplot(num_obs_ph, aes(x = month, y = Count, group = flag, col = flag)) +
  geom_line() +
  geom_point() +
  facet_wrap(~year) +
  labs(title = 'number of adjusted pH observations',
       y = 'number of observations',
       x = 'date') +
  theme_bw()

Version	Author	Date
4840e49	pasqualina-vonlanthendinenna	2021-10-12

1.3 Nitrate

nitrate = data.frame(full_data$longitude, full_data$latitude, full_data$date, full_data$nitrate_adjusted, full_data$nitrate_adjusted_qc) # extract nitrate from the combined data/metadata dataframe

colnames(nitrate) = c('longitude','latitude', 'date', 'nitrate_adjusted', 'flag')
# rename columns 

nitrate$date.simple = as.Date(nitrate$date)
nitrate$time = format(nitrate$date, '%H:%M:%S') # separate the date into date and time columns

nitrate = nitrate %>%
  mutate(year = year(date.simple),
         month = month(date.simple),
         day = day(date.simple),
         cycle = full_data$cycle_number,
         float_ID = full_data$float_serial_no,
         profile_flag = full_data$profile_nitrate_qc) # separate year, month, and day, and add cycle number, float ID, and the profile qc flag 

nitrate.no.na = nitrate %>%
  filter(!is.na(nitrate_adjusted)) # remove NA values

num_obs_nitrate = nitrate.no.na %>%
  group_by(year, month, flag) %>%
  summarise(Count = n())   # count the number of nitrate observations per month

`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.

Plot the number of nitrate observations per month as a timeseries. Different lines are plotted for different qc flags

ggplot(num_obs_nitrate, aes(x = month, y = Count, group = flag, col = flag))+
  geom_line()+
  geom_point()+
  facet_wrap(~ year)+
  labs(title = 'number of adjusted nitrate observations',
       y = 'number of observations',
       x = 'date') +
  theme_bw()

Version	Author	Date
4840e49	pasqualina-vonlanthendinenna	2021-10-12

1.4 All BGC variables (oxygen, pH, and nitrate) We can count the number of observations for which all three BGC variables exist. (QC flags cannot be included since they are specific to one variable)

# create a dataframe which contains all three variables, with longitude, latitude, date, cycle number and float ID 
bgc_co_located = data.frame(full_data$longitude, full_data$latitude,
                            full_data$date,
                            full_data$cycle_number,
                            full_data$float_serial_no,
                            full_data$doxy_adjusted,
                            full_data$ph_in_situ_total_adjusted,
                            full_data$nitrate_adjusted)

colnames(bgc_co_located) = c('longitude', 'latitude',
                             'date',
                             'cycle',
                             'float_ID',
                             'doxy_adjusted',
                             'ph_in_situ_total_adjusted',
                             'nitrate_adjusted') # rename the columns 

bgc_co_located = bgc_co_located %>%     # change the date and time format
  mutate(date.simple = as.Date(date),
         time = format(date, '%H:%M:%S'),
         year = year(date.simple),
         month = month(date.simple),
         day = day(date.simple))

bgc_co_located.no.na = bgc_co_located %>%  # remove NA values for each variable
  filter(!is.na(doxy_adjusted)) %>%
  filter(!is.na(ph_in_situ_total_adjusted)) %>%
  filter(!is.na(nitrate_adjusted))
# removes rows of pH and nitrate for which there is no oxygen, rows of oxygen and nitrate for which there is no pH, and rows of oxygen and pH for which there is no nitrate 

# count the number of observations left for each year and month 
num_obs_bgc = bgc_co_located.no.na %>%
  group_by(year, month) %>%
  summarise(Count = n())

`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

We can then plot the evolution over time of the number of BGC observations containing all three variables

ggplot(num_obs_bgc, aes(x = month, y = Count))+
  geom_line()+
  geom_point()+
  facet_wrap(~ year)+
  labs(title = 'number of adjusted BGC observations',
       y = 'number of observations',
       x = 'date') +
  theme_bw()

Version	Author	Date
795b5ad	pasqualina-vonlanthendinenna	2021-10-13

NUMBER OF PROFILES

Using the dataframes created above for oxygen, pH, and nitrate, we can also look at the number of profiles over time for each variable. (One profile has multiple depth levels, between 0-2000 m)

2.1 Oxygen

prof_oxy = oxy.no.na %>%
  group_by(float_ID, cycle, profile_flag, year, month) %>%     
  summarise(num_obs = n())  # count the number of oxygen observations for each float and cycle

`summarise()` has grouped output by 'float_ID', 'cycle', 'profile_flag', 'year'. You can override using the `.groups` argument.

prof_oxy = prof_oxy %>%
  mutate(prof = rep(1, length(cycle)))   # repeat a vector of 1s for each individual cycle (one cycle corresponds to one profile)

num_prof_oxy = prof_oxy %>%
  group_by(year, month, profile_flag) %>%
  summarise(count_prof = n())       # count the number of 1s (profiles)

`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.

Plot the evolution of the number of oxygen profiles over time

ggplot(num_prof_oxy, aes(x = month, y = count_prof, group = profile_flag, col = profile_flag)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ year, ncol = 3) +
  labs(title = 'number of oxygen profiles',
       y = 'number of profiles',
       x = 'month') +
  theme_bw()

Version	Author	Date
4840e49	pasqualina-vonlanthendinenna	2021-10-12

2.2 pH

prof_ph = ph.no.na %>%
  group_by(float_ID, cycle, profile_flag, year, month) %>%     
  summarise(num_obs = n()) # count the number of ph observations by float and cycle

`summarise()` has grouped output by 'float_ID', 'cycle', 'profile_flag', 'year'. You can override using the `.groups` argument.

prof_ph = prof_ph %>%
  mutate(prof = rep(1, length(cycle))) # repeat a vector of 1s over the length of the cycles (one cycle corresponds to one pH profile)

num_prof_ph = prof_ph %>%
  group_by(year, month, profile_flag) %>%
  summarise(count_prof = n()) # count the number of 1s (profiles)

`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.

Plot the number of pH profiles over time

ggplot(num_prof_ph, aes(x = month, y = count_prof, group = profile_flag, col = profile_flag)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ year, ncol = 3) +
  labs(title = 'number of pH profiles',
       y = 'number of profiles',
       x = 'month') +
  theme_bw()

Version	Author	Date
4840e49	pasqualina-vonlanthendinenna	2021-10-12

2.3 Nitrate

prof_nitrate = nitrate.no.na %>%
  group_by(float_ID, cycle, profile_flag, year, month) %>%  
  summarise(num_obs = n()) # count the number of nitrate observations by float and cycle

`summarise()` has grouped output by 'float_ID', 'cycle', 'profile_flag', 'year'. You can override using the `.groups` argument.

prof_nitrate = prof_nitrate %>%
  mutate(prof = rep(1, length(cycle))) # repeat a vector of 1s over the length of cycle numbers (one cycle is one profile)

num_prof_nitrate = prof_nitrate %>%
  group_by(year, month, profile_flag) %>%
  summarise(count_prof = n()) # count the number of 1s (profiles)

`summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.

Plot the number of nitrate profiles over time

ggplot(num_prof_nitrate, aes(x = month, y = count_prof, group = profile_flag, col = profile_flag)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ year, ncol = 3) +
  labs(title = 'number of nitrate profiles',
       y = 'number of profiles',
       x = 'month') +
  theme_bw()

Version	Author	Date
4840e49	pasqualina-vonlanthendinenna	2021-10-12

2.4 All BGC variables (oxygen, pH, and nitrate) We can count the number of profiles which contain all three variables. ( Profile QC flags cannot be included since they are specific to one variable)

# count the number of profiles for which all three variables exist
prof_bgc = bgc_co_located.no.na %>% 
  group_by(float_ID, cycle, year, month) %>%
  summarise(num_obs = n()) # count the number of observations by float and cycle number

`summarise()` has grouped output by 'float_ID', 'cycle', 'year'. You can override using the `.groups` argument.

prof_bgc = prof_bgc %>%
  mutate(prof = rep(1, length(cycle))) # repeat a vector of 1s over the number of cycles

num_prof_bgc = prof_bgc %>%
  group_by(year, month) %>%
  summarise(count_prof = n())  # count the number of 1s

`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Plot the number of BGC profiles for which all three variables exist over time

ggplot(num_prof_bgc, aes(x = month, y = count_prof)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ year) +
  labs(title = 'number of BGC profiles',
       y = 'number of profiles',
       x = 'month') +
  theme_bw()

Version	Author	Date
5331669	pasqualina-vonlanthendinenna	2021-10-13

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: openSUSE Leap 15.2

Matrix products: default
BLAS:   /usr/local/R-4.0.3/lib64/R/lib/libRblas.so
LAPACK: /usr/local/R-4.0.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.7.9     argodata_0.0.0.9000 forcats_0.5.0      
 [4] stringr_1.4.0       dplyr_1.0.5         purrr_0.3.4        
 [7] readr_1.4.0         tidyr_1.1.3         tibble_3.1.3       
[10] ggplot2_3.3.5       tidyverse_1.3.0     workflowr_1.6.2    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7        prettyunits_1.1.1 assertthat_0.2.1  rprojroot_2.0.2  
 [5] digest_0.6.27     utf8_1.2.2        R6_2.5.1          cellranger_1.1.0 
 [9] backports_1.1.10  reprex_0.3.0      evaluate_0.14     highr_0.8        
[13] httr_1.4.2        pillar_1.6.2      progress_1.2.2    rlang_0.4.11     
[17] readxl_1.3.1      rstudioapi_0.13   whisker_0.4       jquerylib_0.1.4  
[21] blob_1.2.1        rmarkdown_2.10    labeling_0.4.2    bit_4.0.4        
[25] munsell_0.5.0     broom_0.7.9       compiler_4.0.3    httpuv_1.6.2     
[29] modelr_0.1.8      xfun_0.25         pkgconfig_2.0.3   htmltools_0.5.1.1
[33] tidyselect_1.1.0  fansi_0.5.0       tzdb_0.1.2        crayon_1.4.1     
[37] dbplyr_1.4.4      withr_2.4.2       later_1.3.0       grid_4.0.3       
[41] jsonlite_1.7.2    gtable_0.3.0      lifecycle_1.0.0   DBI_1.1.1        
[45] git2r_0.27.1      magrittr_2.0.1    scales_1.1.1      vroom_1.5.5      
[49] cli_3.0.1         stringi_1.5.3     farver_2.1.0      fs_1.5.0         
[53] promises_1.2.0.1  xml2_1.3.2        bslib_0.2.5.1     ellipsis_0.3.2   
[57] generics_0.1.0    vctrs_0.3.8       tools_4.0.3       bit64_4.0.5      
[61] glue_1.4.2        RNetCDF_2.4-2     hms_0.5.3         parallel_4.0.3   
[65] yaml_2.2.1        colorspace_2.0-2  rvest_0.3.6       knitr_1.33       
[69] haven_2.3.1       sass_0.4.0

count_observations

Pasqualina Vonlanthen & Jens Daniel Müller

14 October, 2021