Overview

The purpose of this vignette is to walk one through how to calculate MHWs in both the R and Python languages. It will use code from both languages throughout. The secondary purpose of this vignette is to show that the outputs are identical. The tertiary purpose of this vignette is to run benchmarks on the code to compare their speeds.

Calculating events

First up, let’s look at the two different ways in which one may calculate MHWs. First up, the R code, then Python.

R code

Setup

The basic functionality for calculating marine heatwaves (MHWs) in R may be found in the heatwaveR package that may currently be downloaded and installed with the following lines of code. Note that if one has already installed these packages they do not need to be installed again. To run one of the lines in the following code chunk will require that the hashtag first be removed.

# install.packages("devtools")
## Development version from GitHub
# devtools::install_github("robwschlegel/heatwaveR") 
## Stable version from rOpenSci
## Currently not available
## Stable version from CRAN
# install.packages("heatwaveR") # Currently not available

With the necessary packages installed, we activate heatwaveR with the following line of code.

library(heatwaveR)

Default output

With everything downloaded and ready for us to use we may now calculate some events. The heatwaveR package has three built in time series (sst_WA, sst_Med, sst_NW_Atl) that we may use to more easily demonstrate how the code works. The general pipeline in R for calculating events is first to create a climatology from a chosen time series using ts2clm(), and then to feed that ooutput into detect_event(), as seen below.

# First we create a default climatology as outlined in Hobday et al. (2016)
ts_clim <- ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))

# Then we feed that object into the second and final function
res <- detect_event(ts_clim)

To look at these outputs we would use the following options. For now we’ll just look at the event output.

# Look at the top six rows of the first 6 columns of the events
res$event[1:6, 1:6]
## # A tibble: 6 x 6
##   event_no index_start index_peak index_end duration date_start
##      <int>       <dbl>      <int>     <dbl>    <int> <date>    
## 1        1         884        887       895       12 1984-06-02
## 2        2         899        901       904        6 1984-06-17
## 3        3         908        922       926       19 1984-06-26
## 4        4        1008       1010      1012        5 1984-10-04
## 5        5        1023       1027      1029        7 1984-10-19
## 6        6        1033       1034      1052       20 1984-10-29
# Or perhaps the most intense event
res$event[res$event$intensity_max == max(res$event$intensity_max), 1:6]
## # A tibble: 1 x 6
##   event_no index_start index_peak index_end duration date_start
##      <int>       <dbl>      <int>     <dbl>    <int> <date>    
## 1       42       10629      10651     10689       61 2011-02-06

Python code

Setup

To download and install the Python package for calculating MHWs one may run the following line of code in a console:

# Note that the hashtag must be removed in order
# pip install git+git://github.com/ecjoliver/marineHeatWaves@master

Or simply download the GitHub repository and follow the instructions there for downloading. Or if still lost, phone a friend!

Before we begin the calculations we need to crerate a time series. I’ve chosen to take the sst_WA time series from the heatwaveR package to ensure consistency between the results. For ease of use I simply saved this object as a csv file on my computer and loaded it back in to Python.

ts <- sst_WA
# Write out only a vector of temperatures
write.csv(ts$temp, "data/sst_WA.csv", row.names = FALSE)

Default calculations

With the package installed it is then activated and run as follows:

# Required for data prep and export
import numpy as np
from datetime import date
import pandas as pd
# The MHW functions
import marineHeatWaves as mhw
# The date values
t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)
# The temperature values
sst = np.loadtxt(open("data/sst_WA.csv", "r"), delimiter=',', skiprows=1) # because a heading row needs to be skipped
# The event metrics
mhws, clim = mhw.detect(t, sst)
# Save event results
mhws_df = pd.DataFrame.from_dict(mhws)
mhws_df.to_csv('data/mhws_py.csv', sep = ',', index=False)
# Save climatology results
clim_df = pd.DataFrame.from_dict(clim)
clim_df.to_csv('data/clim_py.csv', sep = ',', index=False)

Reticulate code

It is also possible to run the Python code through R with the use of the R package reticulate. This is particularly useful as it allows us to perform the comparisons and benchmarking all within the same language.

Setup

Here we load the reticulate package and choose the conda environment I’ve already created called py27. For help on how to set up a conda environment go here. I’ve ensured that all of the required python modules are installed within this environment.

# install.packages("reticulate")
library(reticulate)
use_condaenv("py27")

Once we’ve told R which version/environment we would like to use for Python we may then load the necessary modules.

np <- import("numpy")
datetime <- import("datetime")
mhw <- import("marineHeatWaves")

One may run python code in it’s native form within R by passing it as a character vector to py_run_string().

py_run_string("x = 10")
py_run_string("t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)")
py_run_string("sst = np.loadtxt(open('data/sst_WA.csv', 'r'), delimiter=',', skiprows=1)")

Default calculations

# These numbers were taken from print(t) in the python code above
t <- as.integer(np$array(seq(723546, 735598)))
sst <- np$array(sst_WA$temp)
res_python <- mhw$detect(t = py$t, temp = py$sst) # There appears to be an Rcpp issue preventing this from handshaking...
## Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: integer argument expected, got float
## 
## Detailed traceback: 
##   File "/home/rws/anaconda2/envs/py27/lib/python2.7/site-packages/marineHeatWaves.py", line 188, in detect
##     year[i] = date.fromordinal(t[i]).year
# The reticulate package is still in development...

Comparisons

With some climatologies and events calculated with both languages we may now compare the results of the two. I’ll do so here natively in R to avoid any potential hiccups from translating across languages. I’ll create the R output here, and load the Python output created above.

# Load libraries for data manipulation and visualisation
library(tidyverse)

# Set R results
res_event_R <- res$event
res_clim_R <- res$climatology

# Set Python results
res_event_Python <- read_csv("data/mhws_py.csv")
res_clim_Python <- read_csv("data/clim_py.csv")

With these default results loaded in the same format in the same language we can now start to look at how they stack up. For starters I am just doing some simple sums and correlations.

Climatologies

cor(res_clim_R$seas, res_clim_Python$seas)
## [1] 0.9999999
sum(res_clim_R$seas) - sum(res_clim_Python$seas)
## [1] -2.039741
cor(res_clim_R$thresh, res_clim_Python$thresh)
## [1] 0.9999996
sum(res_clim_R$thresh) - sum(res_clim_Python$thresh)
## [1] -6.563502

The seasonal and threshold climatologies correlate very well, but adding them up we see that the Python values are consistently higher. This is due to rounding differences between the languages. Now for the events.

Events

First we peak at a few specific values of interest.

cor(res_event_R$intensity_max, res_event_Python$intensity_max)
## [1] 0.9999997
sum(res_event_R$intensity_max) - sum(res_event_Python$intensity_max)
## [1] 0.008638202
cor(res_event_R$intensity_var, res_event_Python$intensity_var)
## [1] 0.999608
sum(res_event_R$intensity_var) - sum(res_event_Python$intensity_var)
## [1] 0.7344747
cor(res_event_R$rate_onset, res_event_Python$rate_onset)
## [1] 1
sum(res_event_R$rate_onset) - sum(res_event_Python$rate_onset)
## [1] -0.0001461964
cor(res_event_R$rate_decline, res_event_Python$rate_decline)
## [1] 0.9999999
sum(res_event_R$rate_decline) - sum(res_event_Python$rate_decline)
## [1] 0.0002725762

With that looking fine, we next run a for loop that will run a correlation on all rows with the same name.

# Remove non-numeric columns
res_event_num <- res_event_R %>% 
  select_if(is.numeric)

# Run the loop
res_event <- data.frame()
for(i in 1:length(colnames(res_event_num))){
  if(colnames(res_event_num)[i] %in% colnames(res_event_Python)){
    x1 <- res_event_R[colnames(res_event_R) == colnames(res_event_num)[i]]
    x2 <- res_event_Python[colnames(res_event_Python) == colnames(res_event_num)[i]]
    x <- data.frame(r = cor(x1, x2, use = "complete.obs"),
                    difference = round(sum(x1, na.rm = T) - sum(x2, na.rm = T), 4),
                    var = colnames(res_event_num)[i])
    colnames(x)[1] <- "r"
    rownames(x) <- NULL
    } else {
      x <- data.frame(r = NA, difference = NA, var = colnames(res_event_num)[i])
      }
  res_event <- rbind(res_event, x)
  }
res_event
##            r difference                            var
## 1         NA         NA                       event_no
## 2  1.0000000    60.0000                    index_start
## 3  1.0000000    60.0000                     index_peak
## 4  1.0000000    60.0000                      index_end
## 5  1.0000000     0.0000                       duration
## 6  0.9999991     0.0086                 intensity_mean
## 7  0.9999997     0.0086                  intensity_max
## 8  0.9996080     0.7345                  intensity_var
## 9  1.0000000     0.1616           intensity_cumulative
## 10 0.9999971     0.0158       intensity_mean_relThresh
## 11 0.9999994     0.0166        intensity_max_relThresh
## 12 0.9995685     0.7374        intensity_var_relThresh
## 13 0.9999996     0.4741 intensity_cumulative_relThresh
## 14 1.0000000     0.0000             intensity_mean_abs
## 15 0.9990733     1.6100              intensity_max_abs
## 16 0.9997267     0.7667              intensity_var_abs
## 17 1.0000000     0.0000       intensity_cumulative_abs
## 18 1.0000000    -0.0001                     rate_onset
## 19 0.9999999     0.0003                   rate_decline

With all of our overlapping columns compared, and our differences and Pearson r values looking solid, let’s finish off this basic comparison by finding which columns are not shared between the different language outputs. Also, please note that the apparently large differences between the index_start, index_peak, and index_end values are due to the different indexing methods of R and Python. R starts at 1 and Python at 0.

cols_R <- colnames(res_event_R)[!(colnames(res_event_R) %in% colnames(res_event_Python))]
cols_R
## [1] "event_no"
cols_Py <- colnames(res_event_Python)[!(colnames(res_event_Python) %in% colnames(res_event_R))]
cols_Py
## [1] "category"          "duration_extreme"  "duration_moderate"
## [4] "duration_severe"   "duration_strong"   "n_events"         
## [7] "time_end"          "time_peak"         "time_start"

Wonderful! Most things match up nicely. The duration of categories of events is something added in R by another function category(), and will be left that way for now. The “time” columns in the Python output aren’t relavent as far as I can see in the present usage as these functions currently only take day values. THe different methods of labelling the events will be left as they are for now as well.

It is also worth noting that the values for index_start, index_peak, and index_end are off by one between the two languages. This is due to the difference in indexing between the languages. Looking at the date_start, date_peak, and date_end values we see that they are the same.

Benchmarks

The final thing we want to look at in this vignette is the speed differences in calculating MHWs between the two languages.

R

library(microbenchmark)
# The old R code
library(RmarineHeatWaves)
microbenchmark(detect(make_whole(data = sst_WA), climatology_start = "1982-01-01", climatology_end = "2014-12-31"), times = 10)
## Unit: milliseconds
##                                                                                                      expr
##  detect(make_whole(data = sst_WA), climatology_start = "1982-01-01",      climatology_end = "2014-12-31")
##       min       lq     mean   median       uq      max neval
##  677.5837 681.5922 873.8009 699.8798 883.0094 1947.545    10
# The new R code
microbenchmark(detect_event(ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"), robust = FALSE)), times = 10)
## Unit: milliseconds
##                                                                                                         expr
##  detect_event(ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01",      "2014-12-31"), robust = FALSE))
##       min       lq     mean  median      uq      max neval
##  167.6448 174.1651 241.5099 182.385 334.037 355.6131    10

Python

import time
total = 0
for i in range(10):
    start = time.clock()
    mhws, clim = mhw.detect(t, sst)
    end = time.clock()
    total += end-start
bench_py = total/10
print(bench_py)
## 0.1806082

The Python code appears to be about twoce as fast at ~0.18 seconds to ~0.32 seconds for R. And the Python code is also calculating the event categories, which R is not. That is done in an additonal step with category(). The R code is however allowing for a wider range of options for climaotlogies.