Variable preparation

Last updated: 2019-07-09

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed: set.seed(20190513)

The command set.seed(20190513) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

✔ Repository version: 497eeb2

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/NAPA_clim_U.Rda
    Ignored:    data/NAPA_clim_V.Rda
    Ignored:    data/NAPA_clim_W.Rda
    Ignored:    data/NAPA_clim_emp_ice.Rda
    Ignored:    data/NAPA_clim_emp_oce.Rda
    Ignored:    data/NAPA_clim_fmmflx.Rda
    Ignored:    data/NAPA_clim_mldkz5.Rda
    Ignored:    data/NAPA_clim_mldr10_1.Rda
    Ignored:    data/NAPA_clim_qemp_oce.Rda
    Ignored:    data/NAPA_clim_qla_oce.Rda
    Ignored:    data/NAPA_clim_qns.Rda
    Ignored:    data/NAPA_clim_qsb_oce.Rda
    Ignored:    data/NAPA_clim_qt.Rda
    Ignored:    data/NAPA_clim_runoffs.Rda
    Ignored:    data/NAPA_clim_ssh.Rda
    Ignored:    data/NAPA_clim_sss.Rda
    Ignored:    data/NAPA_clim_sst.Rda
    Ignored:    data/NAPA_clim_taum.Rda
    Ignored:    data/NAPA_clim_vars.Rda
    Ignored:    data/NAPA_clim_vecs.Rda
    Ignored:    data/node_mean_all_anom.Rda
    Ignored:    data/packet_all_anom.Rda
    Ignored:    data/som_all_anom.Rda
    Ignored:    data/synoptic_states.Rda
    Ignored:    data/synoptic_vec_states.Rda

Unstaged changes:
    Modified:   analysis/_site.yml
    Modified:   code/workflow.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
Rmd	7ff9b8b	robwschlegel	2019-06-17	More work on the talk
html	c23c50b	robwschlegel	2019-06-10	Build site.
html	028d3cc	robwschlegel	2019-06-10	Build site.
html	c61a15f	robwschlegel	2019-06-06	Build site.
Rmd	49c6105	robwschlegel	2019-06-06	Finished vector data packet pipeline
Rmd	44ac335	robwschlegel	2019-06-06	Working on inclusion of vectors into SOM pipeline
html	6dd6da8	robwschlegel	2019-06-06	Build site.
Rmd	990693a	robwschlegel	2019-06-05	First SOM result visuals
Rmd	4838cc8	robwschlegel	2019-06-04	Working on SOM functions
Rmd	94ce8f6	robwschlegel	2019-06-04	Functions for creating data packets are up and running
Rmd	65301ed	robwschlegel	2019-05-30	Push before getting rid of some testing structure
Rmd	0717c84	robwschlegel	2019-05-29	Working towards getting the variable climatologies in order
Rmd	2c3f68c	robwschlegel	2019-05-28	Working on the variable prep vignette
html	c09b4f7	robwschlegel	2019-05-24	Build site.
Rmd	5dc8bd9	robwschlegel	2019-05-24	Finished initial creation of SST prep vignette.

Introduction

This vignette will walk through the steps needed to create mean ‘whatever’ states during all of the MHWs detected in the previous SST preparation vignette. These ‘whatever’ states are any of the abiotic variables present in the NAPA model that have been deemed relevant w.r.t. forcing of extreme ocean surface temperatures.

# Packages used in this vignette
library(jsonlite, lib.loc = "../R-packages/")
library(tidyverse) # Base suite of functions
library(ncdf4) # For opening and working with NetCDF files
library(lubridate) # For convenient date manipulation

# Load functions required below
source("code/functions.R")

# Set number of cores
doMC::registerDoMC(cores = 50)

# Disable scientific notation for numeric values
  # I just find it annoying
options(scipen = 999)

# Corners of the study area
NWA_corners <- readRDS("data/NWA_corners.Rda")

# The NAPA data location
NAPA_files <- dir("../../data/NAPA025/1d_grid_T_2D", full.names = T)

# The NAPA model lon/lat values
NAPA_coords <- readRDS("data/NAPA_coords.Rda")

# Load NAPA bathymetry/lon/lat
NAPA_bathy <- readRDS("data/NAPA_bathy.Rda")

# Load MHW results
NAPA_MHW_sub <- readRDS("data/NAPA_MHW_sub.Rda")

# MHW Events
NAPA_MHW_event <- NAPA_MHW_sub %>%
  select(-clims, -cats) %>%
  unnest(events) %>%
  filter(row_number() %% 2 == 0) %>%
  unnest(events)

For the upcoming variable prep we are also going to want the NAPA coordinates that are within our chosen study area as seen with NWA_corners. We will also create a subsetted bathy coord file, too, so as to have an effective land mask. This will help us to reduce a lot of computational cost as we go along because many of the pixels over land are given 0 values, rather than missing values, which is a strange choice…

# The NAPPA coordinates for the study area only
NAPA_coords_sub <- NAPA_coords %>% 
  filter(lon >= NWA_corners[1], lon <= NWA_corners[2],
           lat >= NWA_corners[3], lat <= NWA_corners[4])
# saveRDS(NAPA_coords_sub, "data/NAPA_coords_sub.Rda")

# Tha NAPA bathymetry for the study area only
NAPA_bathy_sub <- NAPA_bathy %>% 
  right_join(NAPA_coords_sub, by = c("lon_index", "lat_index", "lon", "lat")) %>% 
  na.omit()
# saveRDS(NAPA_bathy_sub, "data/NAPA_bathy_sub.Rda")

Chosen variables

There are many variables present in the NAPA model, more than we would really need to use for this project. We have therefore chosen to narrow our investigation. Listed below are all of the variables found within a given NAPA surface layer NetCDF file.

# Spreadsheet of variables present in the NetCDF files
NAPA_vars <- ncdump::NetCDF(NAPA_files[1])$variable[1:6]
NAPA_vars

# A tibble: 20 x 6
   name         ndims natts prec   units            longname              
   <chr>        <int> <int> <chr>  <chr>            <chr>                 
 1 emp_ice          3    10 float  kg/m2/s          Evap minus Precip ove…
 2 emp_oce          3    10 float  kg/m2/s          Evap minus Precip ove…
 3 fmmflx           3     9 float  kg/m2/s          Water flux due to fre…
 4 mldkz5           3    10 float  m                Turbocline depth (Kz …
 5 mldr10_1         3    10 float  m                Mixed Layer Depth (ds…
 6 nav_lat          2     3 float  degrees_north    Latitude              
 7 nav_lon          2     3 float  degrees_east     Longitude             
 8 qemp_oce         3     9 float  W/m2             Downward Heat Flux fr…
 9 qla_oce          3    10 float  W/m2             Latent Downward Heat …
10 qns              3     9 float  W/m2             non solar Downward He…
11 qsb_oce          3    10 float  W/m2             Sensible Downward Hea…
12 qt               3    10 float  W/m2             Net Downward Heat Flux
13 runoffs          3    10 float  kg/m2/s          Water Flux into Sea W…
14 ssh              3    10 float  m                sea surface height    
15 sss              3    10 float  0.001            Sea Surface Salinity  
16 sst              3    10 float  degree_C         Sea Surface Temperatu…
17 taum             3    10 float  N/m2             wind stress module    
18 time_center…     1     6 double seconds since 1… Time axis             
19 time_center…     2     0 double ""               time_centered_bounds  
20 time_counte…     2     0 double ""               time_counter_bounds

These variable are not an unruly amount of information and so we will extract and compile most of them when creating the data packets that will be fed into our SOMs later. Many of these variables are likely to have very high auto-correlative relationships in which case we will need to select only the most representative of them. I foresee the multiple heat terms coming to a head with one another, though I can’t yet say from this early vantage point which will be best.

It should go without saying, but we will not be accounting for lon/lat or the three time variables at the end of the above spreadsheet as these are not proper abiotic variables. Also, upon closer examination it was discovered that the heat terms for sensible (qsb_oce) and latent (qla_oce) heat flux into the ocean do not contain any data, so they cannot be used here. Lastly, we only want to use variables that are present in all of the regions. This means that river runoff (runoffs) is not relevant to this work as it’s localised effects will be reflected in changes to sss, which is a variable present in all regions, whereas river runoff is not. We will also exclude evaporation/preciptation over ice (emp_ice) as this variable is only present in the northern regions. That being said, we will keep water flux due t0 freezing/melting ice (fmmflx), even though this variable does not have data in all regions, because we suspect it may play a role in the formation of MHWs. Specifically as they pertain to phenological shifts in ice formation.

Therefore our initial variable list is as follows:

# Remove unwanted variables
NAPA_vars <- ncdump::NetCDF(NAPA_files[1])$variable[c(2:5, 8, 10, 12, 14:17), 1:6]

# Save
# saveRDS(NAPA_vars, "data/NAPA_vars.Rda")

NAPA_vars

# A tibble: 11 x 6
   name     ndims natts prec  units   longname                            
   <chr>    <int> <int> <chr> <chr>   <chr>                               
 1 emp_oce      3    10 float kg/m2/s Evap minus Precip over ocean        
 2 fmmflx       3     9 float kg/m2/s Water flux due to freezing/melting  
 3 mldkz5       3    10 float m       Turbocline depth (Kz = 5e-4)        
 4 mldr10_1     3    10 float m       Mixed Layer Depth (dsigma = 0.01 wr…
 5 qemp_oce     3     9 float W/m2    Downward Heat Flux from E-P over op…
 6 qns          3     9 float W/m2    non solar Downward Heat Flux        
 7 qt           3    10 float W/m2    Net Downward Heat Flux              
 8 ssh          3    10 float m       sea surface height                  
 9 sss          3    10 float 0.001   Sea Surface Salinity                
10 sst          3    10 float degree… Sea Surface Temperature             
11 taum         3    10 float N/m2    wind stress module

Synoptic states

With the variables chosen, the next step is to create mean synoptic states for each variable during each of the MHWs detected in all sub-regions. In order to make that process go more smoothly we will first create a date index of all of the NAPA files present on Eric Oliver’s local server. Unfortunately that means that from here out the code in this vignette will only run on said server. The output of this vignette will however be publicly available here.

Date index for NAPA files

To create the index of dates to be found within each of the thousands of NAPA surface NetCDF files we will use a simple for loop to crawl through the files and write down for us in one long spreadsheet which dates are to be found in which files. While this could be done on the fly in the following steps, it will just be easier to have a stable index prepared.

# Pull out the dates
NAPA_files_dates <- data.frame()
for(i in 1:length(NAPA_files)){
  file_name <- NAPA_files[i]
  date_start <- ymd(str_sub(basename(as.character(file_name)), start = 29, end = 36))
  date_end <- ymd(str_sub(basename(as.character(file_name)), start = 38, end = 45))
  date_seq <- seq(date_start, date_end, by = "day")
  date_info <- data.frame(file = file_name, date = date_seq)
  NAPA_files_dates <- rbind(date_info, NAPA_files_dates)
}

# Order by date, just for tidiness
NAPA_files_dates <- dplyr::arrange(NAPA_files_dates, date)

# Save
# saveRDS(NAPA_files_dates, "data/NAPA_files_dates.Rda")

Variable climatologies

Part of the data packet we need to create for the SOMs is the anomaly values. In order to create anomalies however we need to first create climatologies for all of the variables. This may prove to be a somewhat daunting task, but it’s what we are here to do! In order to create a climatology of values we will need to load all of the files and then pixel-wise go about getting the seasonal (daily) climatologies. This will be done with the same function (ts2clm()) that is used for the MHW climatologies. We will first create a function that extracts the desired variables from any NetCDF files fed to it. With that done it should be a routine matter to get the climatologies. Hold onto your hats, this is going to be RAM heavy…

# Load functions required below
source("code/functions.R")

# NB: The creation of a clim for one variable is too large to run via ldply
# Rather they must be run one at a time via a for loop and the memmory dumped after each
for(i in 1:nrow(NAPA_vars)){
  clim_one_var(NAPA_vars$name[i])
  gc()
}

With that large hurdle jumped, let’s double down and join all of these data together for ease of loading in the future.

# Load all variable climatologies and join variables with a for loop
  # NB: This should be optimised...
NAPA_clim_vars <- data.frame()
# system.time(
for(i in 1:length(NAPA_vars$name)){
  var_one <- readRDS(file = paste0("data/NAPA_clim_",NAPA_vars$name[i],".Rda"))
  if(nrow(NAPA_clim_vars) == 0){
    NAPA_clim_vars <- rbind(var_one, NAPA_clim_vars)
  } else {
    NAPA_clim_vars <- left_join(NAPA_clim_vars, var_one, 
                                by = c("lon", "lat", "doy"))
  }
}
# ) # 115 seconds for all
rm(var_one, i); gc()

# Convert DOY to MM-DD for joining to daily data below
NAPA_clim_vars$doy <- format(as.Date(NAPA_clim_vars$doy, origin = "2015-12-31"), "%m-%d")

# Change column names to highlight that these are climatology values
colnames(NAPA_clim_vars)[-c(1:3)] <- paste0(colnames(NAPA_clim_vars)[-c(1:3)],"_clim")

# Reorder columns
# NAPA_clim_vars <- dplyr::select(NAPA_clim_vars, lon, lat, doy, everything())

# saveRDS(NAPA_clim_vars, "data/NAPA_clim_vars.Rda")

Variable extractor

We needed a list of the dates present in each file so that we can easily load only the NetCDF files we need to extract our desired variables. The dates we want are the range of dates during each of the MHWs detected in the SST preparation vignette. In the chunk below we will create a function that decides which files should have their variables loaded and a function that binds everything up into tidy data packets that our SOM can ingest.

# Load NAPA file date index
NAPA_files_dates <- readRDS("data/NAPA_files_dates.Rda")

# Load full variable climatology file
NAPA_clim_vars <- readRDS("data/NAPA_clim_vars.Rda")

# Function for extracting the desired variables from a given NetCDF file
# testers...
# file_name <- NAPA_files[1]
extract_all_var <- function(file_name){
  
  # Extract and join variables with a for loop
    # NB: This should be optimised...
  NAPA_vars_extracted <- data.frame()
  system.time(
  for(i in 1:length(NAPA_vars$name)){
    extract_one <- extract_one_var(NAPA_vars$name[i], file_name = file_name)
    if(nrow(NAPA_vars_extracted) == 0){
      NAPA_vars_extracted <- rbind(extract_one, NAPA_vars_extracted)
    } else {
      NAPA_vars_extracted <- left_join(NAPA_vars_extracted, extract_one, 
                                       by = c("lon_index", "lat_index", "lon", "lat", "bathy", "t"))
    }
  }
  ) # 18 seconds for one
  NAPA_vars_extracted <- dplyr::select(NAPA_vars_extracted,
                                       lon_index, lat_index, lon, lat, t, bathy, everything())
  
  # Exit
  return(NAPA_vars_extracted)
}

# Function for extracting variables from as many files as a MHW event requires
# testers...
# event_sub <- NAPA_MHW_event[23,]
data_packet <- function(event_sub){
  
  # Create date and file index for loading
  date_idx <- seq(event_sub$date_start, event_sub$date_end, by = "day")
  file_idx <- filter(NAPA_files_dates, date %in% date_idx) %>% 
    mutate(file = as.character(file)) %>% 
    select(file) %>% 
    unique()
  
  # Load required base data
  # system.time(
  packet_base <- plyr::ldply(file_idx$file, extract_all_var) %>% 
    filter(t %in% date_idx) %>% 
    mutate(doy = format(t, "%m-%d"))
  # ) # 125 seconds for seven files
  
  # Join to climatologies
  packet_join <- left_join(packet_base, NAPA_clim_vars, by = c("lon", "lat", "doy"))
  
  # Create anomaly values and remove clim columns
  packet_anom <- packet_join %>% 
    mutate(emp_oce_anom = emp_oce - emp_oce_clim,
           fmmflx_anom = fmmflx - fmmflx_clim,
           mldkz5_anom = mldkz5 - mldkz5_clim,
           mldr10_1_anom = mldr10_1 - mldr10_1_clim,
           qemp_oce_anom = qemp_oce - qemp_oce_clim,
           qns_anom = qns - qns_clim,
           qt_anom = qt - qt_clim,
           ssh_anom = ssh - ssh_clim,
           sss_anom = sss - sss_clim,
           sst_anom = sst - sst_clim,
           taum_anom = taum - taum_clim) %>% 
    dplyr::select(lon, lat, doy, emp_oce:taum_anom, 
                  -c(colnames(NAPA_clim_vars)[-c(1:3)]))
    # dplyr::select(-c(colnames(packet_base)[-c(3,4,ncol(packet_base))]), 
    #               -c(colnames(NAPA_clim_vars)[-c(1:3)]))
  
  # Create mean synoptic values
  packet_mean <- packet_anom %>% 
    select(-doy) %>% 
    # NB: The lowest pixels are a forcing edge and shouldn't be included
      # We can catch these out by filtering pixels whose SST is exactly 0
    filter(sst != 0) %>% 
    group_by(lon, lat) %>% 
    summarise_all(mean, na.rm = T) %>% 
    arrange(lon, lat) %>% 
    ungroup() %>% 
    nest(.key = "synoptic")
  
  # Combine results with MHW dataframe
  packet_res <- cbind(event_sub, packet_mean)
  
  # Test visuals
  # ggplot(packet_mean, aes(x = lon, y = lat)) +
  #   geom_point(aes(colour = sst_anom)) +
  #   scale_colour_gradient2(low = "blue", high = "red") +
  #   coord_cartesian(xlim = NWA_corners[1:2],
  #                   ylim = NWA_corners[3:4])
  
  # Exit
  return(packet_res)
}

With our functions sorted, it is now time to create our data packets.

# Set number of cores
  # NB: Was set to 25 as someone else was using the server at the time
doMC::registerDoMC(cores = 25)

# Create one big packet
# system.time(
synoptic_states <- plyr::ddply(NAPA_MHW_event, c("region", "sub_region", "event_no"), data_packet, .parallel = T)
# ) # 82 seconds for first 2,6125 seconds (102 minutes) for all

# Save
# saveRDS(synoptic_states, "data/synoptic_states.Rda")

With all of our synoptic snapshots for our chosen variables created it is now time to feed them to the Self-organising map (SOM) analysis.

Session information

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2       FNN_1.1.2.1          akima_0.6-2         
 [4] tidync_0.2.1         heatwaveR_0.3.6.9004 data.table_1.11.6   
 [7] lubridate_1.7.4      ncdf4_1.16           forcats_0.3.0       
[10] stringr_1.3.1        dplyr_0.7.6          purrr_0.2.5         
[13] readr_1.1.1          tidyr_0.8.1          tibble_1.4.2        
[16] ggplot2_3.0.0        tidyverse_1.2.1      jsonlite_1.6        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18      lattice_0.20-35   utf8_1.1.4       
 [4] assertthat_0.2.0  rprojroot_1.3-2   digest_0.6.16    
 [7] foreach_1.4.4     R6_2.2.2          cellranger_1.1.0 
[10] plyr_1.8.4        backports_1.1.2   evaluate_0.11    
[13] httr_1.3.1        pillar_1.3.0      rlang_0.2.2      
[16] lazyeval_0.2.1    readxl_1.1.0      ncmeta_0.0.4     
[19] rstudioapi_0.7    whisker_0.3-2     R.utils_2.7.0    
[22] R.oo_1.22.0       rmarkdown_1.10    htmlwidgets_1.3  
[25] munsell_0.5.0     broom_0.5.0       compiler_3.6.1   
[28] modelr_0.1.2      pkgconfig_2.0.2   htmltools_0.3.6  
[31] tidyselect_0.2.4  workflowr_1.1.1   codetools_0.2-15 
[34] doMC_1.3.5        fansi_0.3.0       viridisLite_0.3.0
[37] crayon_1.3.4      withr_2.1.2       R.methodsS3_1.7.1
[40] grid_3.6.1        nlme_3.1-137      gtable_0.2.0     
[43] git2r_0.23.0      magrittr_1.5      scales_1.0.0     
[46] cli_1.0.0         stringi_1.2.4     sp_1.3-1         
[49] xml2_1.2.0        iterators_1.0.10  tools_3.6.1      
[52] glue_1.3.0        RNetCDF_1.9-1     maps_3.3.0       
[55] hms_0.4.2         ncdump_0.0.3      parallel_3.6.1   
[58] yaml_2.2.0        colorspace_1.3-2  rvest_0.3.2      
[61] plotly_4.9.0      knitr_1.20        bindr_0.1.1      
[64] haven_1.1.2

This reproducible R Markdown analysis was created with workflowr 1.1.1