r_vs_python.Rmd
The purpose of this vignette is to walk one through how to calculate MHWs in both the R and Python languages. It will use code from both languages throughout. The secondary purpose of this vignette is to show that the outputs are identical. The tertiary purpose of this vignette is to run benchmarks on the code to compare their speeds.
First up, let’s look at the two different ways in which one may calculate MHWs. First we will look at the R code, then Python.
The basic functionality for calculating marine heatwaves (MHWs) in R may be found in the heatwaveR
package that may currently be downloaded and installed with the following lines of code. Note that if one has already installed these packages they do not need to be installed again. To run one of the lines in the following code chunk will require that the hash-tag first be removed.
## This is an R chunk ##
# install.packages("devtools")
## Development version from GitHub
# devtools::install_github("robwschlegel/heatwaveR")
## Stable version from CRAN
# install.packages("heatwaveR")
With the necessary packages installed, we activate heatwaveR
with the following line of code.
With everything downloaded and ready for us to use we may now calculate some events. The heatwaveR
package has three built in time series (sst_WA
, sst_Med
, sst_NW_Atl
) that we may use to more easily demonstrate how the code works. The general pipeline in R
for calculating events is first to create a climatology from a chosen time series using ts2clm()
, and then to feed that output into detect_event()
, as seen below.
## This is an R chunk ##
# First we create a default climatology as outlined in Hobday et al. (2016)
# NB: We use a longer climatology period here as this matches the Python default
ts_clim <- ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))
# Then we feed that object into the second and final function
res_r <- detect_event(ts_clim)
To look at these outputs we would use the following options. For now we’ll just look at the event output.
## This is an R chunk ##
# Look at the top six rows of the first 6 columns of the events
res_r$event[1:6, 1:6]
## # A tibble: 6 x 6
## event_no index_start index_peak index_end duration date_start
## <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 1 884 887 895 12 1984-06-02
## 2 2 899 901 904 6 1984-06-17
## 3 3 908 922 926 19 1984-06-26
## 4 4 1008 1010 1012 5 1984-10-04
## 5 5 1023 1027 1029 7 1984-10-19
## 6 6 1033 1034 1052 20 1984-10-29
# Or perhaps the most intense event
res_r$event[res_r$event$intensity_max == max(res_r$event$intensity_max), 1:6]
## # A tibble: 1 x 6
## event_no index_start index_peak index_end duration date_start
## <dbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 42 10629 10651 10689 61 2011-02-06
To download and install the Python package for calculating MHWs one may run the following line of code in a console:
## This is a bash chunk ##
# Note that the hashtag must be removed in orde for the code to run
# pip install git+git://github.com/ecjoliver/marineHeatWaves@master
Or simply download the GitHub repository and follow the instructions there for downloading. Or if still lost, phone a friend!
Before we begin the calculations we need to create a time series. I’ve chosen to take the sst_WA
time series from the heatwaveR
package to ensure consistency between the results. The python code prefers to have the dates and the temperatures fed into it as two separate vectors, so I will first split these up in R before we feed them to Python.
With the package installed, and our time series prepped, the basic MHW calculation is performed as follows:
## This is a Python chunk ##
# Required for data prep and export
import numpy as np
from datetime import date
import pandas as pd
# The MHW functions
import marineHeatWaves as mhw
# The date values
# We could take the date values we created above,
# But let's rather create the date values in Python
t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)
# The temperature values
# Note that to fetch an object created in R we use 'r.*'
# Where '*' is the name of the R object
sst = np.asarray(r.sst)
# Or if one rather wants to load the data from a csv file
# sst = np.loadtxt("data/sst_WA.csv", delimiter=',', skiprows=1) # because a heading row needs to be skipped
# The event metrics
mhws, clim = mhw.detect(t, sst)
# If one wants to save the results as csv files
# Save event results
mhws_df = pd.DataFrame.from_dict(mhws)
mhws_df.to_csv('data/mhws_py.csv', sep = ',', index=False)
#
# Save climatology results
clim_df = pd.DataFrame.from_dict(clim)
clim_df.to_csv('data/clim_py.csv', sep = ',', index=False)
It is also possible to run the Python code through R with the use of the R package reticulate
. This is particularly useful as it allows us to perform the comparisons and benchmarking all within the same language.
Here we load the reticulate
package and choose the conda environment I’ve already created called py27
. For help on how to set up a conda environment go here. I’ve ensured that all of the required python modules are installed within this environment.
Once we’ve told R which version/environment we would like to use for Python we may then load the necessary modules.
## This is an R chunk ##
np <- import("numpy")
datetime <- import("datetime")
mhw <- import("marineHeatWaves")
One may run python code in it’s native form within R by passing it as a character vector to py_run_string()
.
## This is an R chunk ##
py_run_string("t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)")
py_run_string("sst = np.loadtxt(open('data/sst_WA.csv', 'r'), delimiter=',', skiprows=1)")
# py_run_string("sst.flags.writeable = True")
# py_run_string("sst.setflags(write = True)")
# py_run_string("sst = np.array(sst).copy()")
Or we may just create the objects in native R.
Either way, once we have created the necessary parts of the time series in either language, we may call the Python functions in R directly with the line of code below.
## This is an R chunk ##
# The following line appears not to run because the numpy array that Python expects is given to it by R in a read-only format...
# Thus far I've not found a way to correct for this.
# So it appears that one may pass R objects to Python code directly, but not to reticulate code...
res_python <- mhw$detect(t = t, temp = sst)
With some climatologies and events calculated with both languages we may now compare the results of the two. I’ll do so here natively in R to avoid any potential hiccups from translating across languages. I’ll create the R output here, and load the Python output created in the Python code chunk above.
## This is an R chunk ##
# Load libraries for data manipulation and visualisation
library(tidyverse)
# Set R results
res_event_R <- res_r$event
res_clim_R <- res_r$climatology
# Set Python results
res_event_Python <- read_csv("data/mhws_py.csv")
res_clim_Python <- read_csv("data/clim_py.csv")
With these default results loaded in the same format in the same language we can now start to look at how they compare. For starters I am just doing some simple sums and correlations.
## [1] 0.9999999
## [1] -2.053821
## [1] 0.9999996
## [1] -6.573385
The seasonal and threshold climatologies correlate very well, but adding them up we see that the Python values are consistently higher. This is due to rounding differences between the languages. Even though the sum difference between the thresholds may look large at a value of ~6.5, considering this is over 12053 days of data, the difference is only ~0.0005/day. So I think it is safe to move on and look at the events.
It was also recently brought to my attention by Zijie Zhao, author of the MATLAB distribution for MHW code, that Python and R calculate February 29th differently. With Python calculating the climatologies first, and then filling in the missing days, whereas R calculates February 29th first, and then the climatologies from that. So with that being said let’s have a peek at the difference this may cause.
## This is an R chunk ##
res_clim_R_29 <- res_clim_R %>%
filter(doy == 60) %>%
select(seas, thresh) %>%
distinct()
res_clim_Python_29 <- res_clim_Python %>%
filter(res_clim_R$doy == 60) %>%
select(seas, thresh) %>%
distinct()
res_clim_R_29$seas - res_clim_Python_29$seas
## [1] 0.0003321473
## [1] -0.005438159
Another thing pointed out by Zijie is that Python and MATLAB calculate percentiles differently, this then is likely also true for R and Python. This would explain why the seasonal climatologies (seas
) are always much closer than the thresholds (thresh
) for all of these tests.
Below we have a for loop that will run a correlation on all rows with the same name, as well as find the sum difference between them.
## This is an R chunk ##
# Remove non-numeric columns
res_event_num <- res_event_R %>%
select_if(is.numeric)
# Run the loop
res_event <- data.frame()
for(i in 1:length(colnames(res_event_num))){
if(colnames(res_event_num)[i] %in% colnames(res_event_Python)){
x1 <- res_event_R[colnames(res_event_R) == colnames(res_event_num)[i]]
x2 <- res_event_Python[colnames(res_event_Python) == colnames(res_event_num)[i]]
x <- data.frame(r = cor(x1, x2, use = "complete.obs"),
difference = round(sum(x1, na.rm = T) - sum(x2, na.rm = T), 4),
var = colnames(res_event_num)[i])
colnames(x)[1] <- "r"
rownames(x) <- NULL
} else {
x <- data.frame(r = NA, difference = NA, var = colnames(res_event_num)[i])
}
res_event <- rbind(res_event, x)
}
na.omit(res_event)
## r difference var
## 2 1.0000000 60.0000 index_start
## 3 1.0000000 60.0000 index_peak
## 4 1.0000000 60.0000 index_end
## 5 1.0000000 0.0000 duration
## 6 0.9999991 0.0089 intensity_mean
## 7 0.9999997 0.0087 intensity_max
## 8 0.9996073 0.7347 intensity_var
## 9 1.0000000 0.1630 intensity_cumulative
## 10 0.9999971 0.0160 intensity_mean_relThresh
## 11 0.9999994 0.0165 intensity_max_relThresh
## 12 0.9995693 0.7376 intensity_var_relThresh
## 13 0.9999996 0.4755 intensity_cumulative_relThresh
## 14 1.0000000 0.0000 intensity_mean_abs
## 15 0.9990733 1.6100 intensity_max_abs
## 16 0.9997263 0.7667 intensity_var_abs
## 17 1.0000000 0.0004 intensity_cumulative_abs
## 18 0.9999999 -0.0003 rate_onset
## 19 0.9999999 0.0002 rate_decline
It is a bit odd that there is such a large difference between intensity_var
for the two language. This implies that R is detecting larger differences in the intensity around the threshold than Python is. This may again be due to the differences in thresholds being detected, but that makes a rather small difference. It is perhaps more likely that variance is calculated slightly differently between languages. Looking at the source code one sees that this is calculated by first finding the variance of the intensity, and then the square root of that. These two calculations likely differ enough between the languages to be causing this. Ultimately it is of little concern as the variance values found throughout both packages are not of much interest to anyone.
We may also see that the difference in intensity_max_abs
is a bit higher in Python than in R. This is rather surprising as intensity_max_abs
simply shows the maximum temperature (independent of any thresholds etc.) experienced during that event. And considering that these calculations were performed on the exact same time series, intensity_max_abs
should always be exactly the same between the two languages. Looking at the climatology output one sees that the temperature is still correct, meaning that the difference must occur somewhere within the R function detect_event
. Looking at the R source code we see that intensity_max_abs
is calculated on line 298, for Python this is calculated on line 376. The reason for the difference is because R is looking for the maximum temperature throughout all dates for a given event, whereas Python is simply taking the temperature on the peak date of the event. This is a subtle but significant difference as this means this value is showing two different things. I’m not certain which is more appropriate, though I’m leaning towards the Python implementation.
With all of our overlapping columns compared, and our differences and Pearson r values looking solid, let’s finish off this basic comparison by finding which columns are not shared between the different language outputs. Also, please note that the apparently large differences between the index_start
, index_peak
, and index_end
values are due to the different indexing methods of R and Python. R starts at 1 and Python at 0. The
## This is an R chunk ##
cols_R <- colnames(res_event_R)[!(colnames(res_event_R) %in% colnames(res_event_Python))]
cols_R
## [1] "event_no"
cols_Py <- colnames(res_event_Python)[!(colnames(res_event_Python) %in% colnames(res_event_R))]
cols_Py
## [1] "category" "duration_extreme" "duration_moderate"
## [4] "duration_severe" "duration_strong" "n_events"
## [7] "time_end" "time_peak" "time_start"
Wonderful! Almost everything matches up exactly. The duration of categories of events is something added in R by another function category()
, and will be left that way for now. The “time” columns in the Python output aren’t relevant as far as I can see in the present usage as these functions currently only take day values. The different methods of labelling the events will be left as they are for now as well.
It is also worth noting that the values for index_start
, index_peak
, and index_end
are off by one between the two languages. This is due to the difference in indexing between the languages. Looking at the date_start
, date_peak
, and date_end
values we see that they are the same.
A bug was discovered in v0.3.3 of the R code with regards to the handling of missing data. Specifically, in the move to a C++ back-end, the R code stopped being able to handle missing data for the climatology calculations. This has since been corrected from v0.3.4 onwards, but it is worthwhile to ensure that missing data are handled the same way.
First we’ll create the dataframe with missing data:
## This is an R chunk ##
library(padr)
library(lubridate)
random_knockout <- function(df, prop){
res <- df %>%
sample_frac(size = prop) %>%
arrange(t) %>%
pad(interval = "day")
return(res)
}
# Remove 10% of the data randomly
sst_WA_miss_random <- random_knockout(sst_WA, 0.9)
write.csv(sst_WA_miss_random$temp, file = "data/sst_WA_miss_random.csv", row.names = F, na = "NaN")
# Remove simulated ice cover
sst_WA_miss_ice <- sst_WA %>%
mutate(month = month(t, label = T),
temp = ifelse(month %in% c("Jan", "Feb", "Mar"), NA, temp)) %>%
select(-month)
write.csv(sst_WA_miss_ice$temp, file = "data/sst_WA_miss_ice.csv", row.names = F, na = "NaN")
With the missing data saved, we can now calculate and compare the climatologies that the two different languages will create.
## This is a Python chunk ##
# Required for data prep and export
import numpy as np
from datetime import date
import pandas as pd
# The MHW functions
import marineHeatWaves as mhw
# The date values
t = np.arange(date(1982,1,1).toordinal(),date(2014,12,31).toordinal()+1)
# The temperature values
sst_random = np.loadtxt("data/sst_WA_miss_random.csv", delimiter = ',', skiprows = 1)
sst_ice = np.loadtxt("data/sst_WA_miss_ice.csv", delimiter = ',', skiprows = 1)
# The event metrics
mhws_random, clim_random = mhw.detect(t, sst_random)
# It appears as though the Python code can't deal with ice coverage...
#mhws_ice, clim_ice = mhw.detect(t, sst_ice)
# Save climatology results
clim_random_df = pd.DataFrame.from_dict(clim_random)
clim_random_df.to_csv('data/clim_py_random.csv', sep = ',', index = False)
mhws_random_df = pd.DataFrame.from_dict(mhws_random)
mhws_random_df.to_csv('data/mhws_py_random.csv', sep = ',', index = False)
#clim_ice_df = pd.DataFrame.from_dict(clim_ice)
#clim_ice_df.to_csv('data/clim_ice_py.csv', sep = ',', index = False)
With the Python climatologies and events calculated from the missing data we’ll now do the same for R.
## This is an R chunk ##
# Load and prep missing data
sst_WA_miss_random <- read_csv("data/sst_WA_miss_random.csv") %>%
dplyr::rename(temp = x) %>%
mutate(t = seq(as.Date("1982-01-01"), as.Date("2014-12-31"), by = "day")) %>%
select(t, temp)
sst_WA_miss_random$temp[is.nan(sst_WA_miss_random$temp)] <- NA
# caluclate R climatologies/events
res_random_R <- detect_event(ts2clm(sst_WA_miss_random, climatologyPeriod = c("1982-01-01", "2014-12-31")))
res_clim_random_R <- res_random_R$climatology
res_event_random_R <- res_random_R$event
# Load Python results
res_clim_random_Python <- read_csv("data/clim_py_random.csv")
res_event_random_Python <- read_csv("data/mhws_py_random.csv")
And then the comparisons of the seas
and thresh
values.
## This is an R chunk ##
# Compare clims/thresholds
cor(res_clim_random_R$seas, res_clim_random_Python$seas)
## [1] 0.9999996
## [1] -2.372016
## [1] 0.9999343
## [1] -866.0401
We may see from the results above that the values still correlate very strongly, if slightly less so than with no missing data. We see again that the Python values are higher overall for both the seasonal climatology and the threshold. This time however the differences are much larger. The seasonal climatology is off by ~0.0004/day, which is acceptable, but the 90th percentile threshold is off by ~0.0695. This is too large of a difference and this issue will need to be investigated. It is most likely due to the difference in how quantiles are calculated between the languages though and so there may be little to do about it. Perhaps more importantly is how these differences may affect the events being detected.
## [1] 57
## [1] 53
## This is an R chunk ##
# Remove non-numeric columns
res_event_random_num <- res_event_random_R %>%
select_if(is.numeric)
# Run the loop
res_event_random <- data.frame()
for(i in 1:length(colnames(res_event_random_num))){
if(colnames(res_event_random_num)[i] %in% colnames(res_event_random_Python)){
x1 <- res_event_random_R[colnames(res_event_random_R) == colnames(res_event_random_num)[i]]
x2 <- res_event_random_Python[colnames(res_event_random_Python) == colnames(res_event_random_num)[i]]
x <- data.frame(difference = round(sum(x1, na.rm = T) - sum(x2, na.rm = T), 4),
var = colnames(res_event_random_num)[i])
# colnames(x)[1] <- "r"
rownames(x) <- NULL
} else {
x <- data.frame(difference = NA, var = colnames(res_event_random_num)[i])
}
res_event_random <- rbind(res_event_random, x)
}
na.omit(res_event_random)
## difference var
## 2 19349.0000 index_start
## 3 19365.0000 index_peak
## 4 19420.0000 index_end
## 5 75.0000 duration
## 6 3.7080 intensity_mean
## 7 6.6284 intensity_max
## 8 2.8726 intensity_var
## 9 93.4348 intensity_cumulative
## 10 2.4525 intensity_mean_relThresh
## 11 5.4013 intensity_max_relThresh
## 12 2.8771 intensity_var_relThresh
## 13 57.0796 intensity_cumulative_relThresh
## 14 91.7193 intensity_mean_abs
## 15 95.5900 intensity_max_abs
## 16 3.0015 intensity_var_abs
## 17 1686.3577 intensity_cumulative_abs
## 18 3.4802 rate_onset
## 19 -0.2195 rate_decline
Because the count of events detected is not the same, we cannot run correlations. We can sum up the differences to get an idea of how far off things are, and we can see that the results are troubling. The differences in the index_*
values can be discounted as we have already established that we have a different number of events. The next largest difference is that for intensity_cumulative_abs
. Looking at the one event that is the same between the languages one may see that this value is the same for both, meaning that we may discount this difference as it is not due to calculation differences, but rather also due to the differing number of events. Regardless, it is an indicator that overall the much lower threshold in the R language is giving us results that look much more dire than the Python language. This is problematic and stems from the quantile calculation issue.
I looked into this in quite some detail and it appears to stem not only from the calculation of quantiles being different, but also from the R quantile calculations never giving more than one additional decimal place of precision over the initial data, whereas the Python results float near infinity. This explains why the R results always tend to be lower, but does not account for why the presence of missing data has such a dramatic effect.
## This is an R chunk ##
clim_only_random_R <- res_clim_random_R %>%
select(doy, seas, thresh) %>%
distinct() %>%
filter(doy != 60)
clim_only_random_Python <- res_clim_random_Python %>%
slice(1:365)
cor(clim_only_random_R$seas, clim_only_random_Python$seas)
## [1] 0.9999996
## [1] -0.07149032
## [1] 0.9999996
## [1] -26.22067
Looking at only the values for one seasonal climatology or threshold cycle we see that the differences are the same as when we compare all of the data. I don’t know why they wouldn’t be… but I wanted to check.
After having gone over the R code very thoroughly I think the next course of action is to go through the step by step output of the Python code to see where exactly these changes begin to occur. It could be in the following places:
runavg
is used for the same purposeThe final thing we want to look at in this vignette is the speed differences in calculating MHWs between the two languages.
library(microbenchmark)
# The old R code
library(RmarineHeatWaves)
microbenchmark(detect(make_whole(data = sst_WA), climatology_start = "1982-01-01", climatology_end = "2014-12-31"), times = 10)
## Unit: seconds
## expr
## detect(make_whole(data = sst_WA), climatology_start = "1982-01-01", climatology_end = "2014-12-31")
## min lq mean median uq max neval
## 1.040593 1.048529 1.211674 1.081987 1.190646 2.022516 10
# The new R code
microbenchmark(detect_event(ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31"))), times = 10)
## Unit: milliseconds
## expr
## detect_event(ts2clm(data = sst_WA, climatologyPeriod = c("1982-01-01", "2014-12-31")))
## min lq mean median uq max neval
## 192.4564 208.8704 279.6261 287.8735 319.2132 383.9136 10
Unsurprisingly, the average for running the heatwave analysis on heatwaveR
10 times comes out at ~0.194 seconds per calculations, which is faster than the average of ~0.730 with RmarineHeatWaves
.
import time
total = 0
for i in range(10):
start = time.clock()
mhws, clim = mhw.detect(t, sst)
end = time.clock()
total += end-start
bench_py = total/10
print(bench_py)
## 0.173738
The average speed for running the Python code 10 times comes out at ~0.175 seconds, which is slightly faster than R. And the Python code is also calculating the event categories, which R is not. That is done in an additional step with category()
. The R code is however allowing for a wider range of options for climatologies, and may be more easily multi-cored, as shown in this vignette.
Overall the basic applications of these two languages provide near identical results, and nearly the same computational time. I would not say that one is better enough than the other to warrant switching between languages. Rather this vignette supports the argument that people collaborating on a research project while using R and Python do not need to worry about their results. The exception currently being that large amounts of missing data appear to be influencing the calculation of the 90th percentile threshold differently. This is something that will be investigated further and corrected as necessary.