Overview

The purpose of this vignette is to quantify the effects that missing data have on the detection of events. Specifically, the relationship between the percentage of missing data, as well as the consecutive number of missing days, on how much the seasonal climatology, the 90th percentile threshold, and the MHW metrics may differ from those detected against the same time series with no missing data.

The missing data will be ‘created’ by striking out existing data from the three pre-packaged time series in the heatwaveR package, which themselves have no missing data. This will be done in two ways:

  • the first method will be to simply use a random selection algorithm to indiscriminately remove an increasing proportion of the data;
  • the second method will be to systematically remove data from specific days/times of the year to simulate missing data that users may encounter in a real-world scenario
    • one method will be to remove only weekend days from the data
    • another method is to remove only winter data to simulate ice-cover

It is hypothesised that the more consecutive missing days of data there are, the bigger the effect will be on the results. It is also hypothesised that consecutive missing days will have a more pronounced effect/be a better indicator of time series strength than the simple proportion of missing data. A third hypothesis is that there will be a relatively clear threshold for consecutive missing days that will prevent accurate MHW detection.

library(tidyverse)
library(broom)
library(heatwaveR)
library(lubridate) # This is intentionally activated after data.table
library(fasttime)
library(ggpubr)
library(boot)
library(FNN)
library(mgcv)
library(padr)
# A clomp of functions used below
# Written out here for tidiness/convenience


# Function for knocking out data but maintaing the time series consistency
random_knockout <- function(df, prop){
  res <- df %>% 
    sample_frac(size = prop) %>% 
    arrange(t) %>%
    pad(interval = "day") %>%
    full_join(sst_cap, by = c("t", "temp", "site")) %>% 
    mutate(miss = as.character(1-prop))
  return(res)
}

# Quantify consecutive missing days
con_miss <- function(df){
  ex1 <- rle(is.na(df$temp))
  ind1 <- rep(seq_along(ex1$lengths), ex1$lengths)
  s1 <- split(1:nrow(df), ind1)
  res <- do.call(rbind,
                 lapply(s1[ex1$values == TRUE], function(x)
                   data.frame(index_start = min(x), index_end = max(x))
                   )) %>% 
    mutate(duration = index_end - index_start + 1) %>% 
    group_by(duration) %>% 
    summarise(count = n())
  return(res)
}

# Function that knocks out pre-set amounts of data
# This is then allows for easier replication
# It expects the data passed to it to already be nested
knockout_repl <- function(df) {
  res <- df %>% 
    mutate(prop_10 = map(data, random_knockout, prop = 0.9),
           prop_25 = map(data, random_knockout, prop = 0.75),
           prop_50 = map(data, random_knockout, prop = 0.5))
  return(res)
}

# Calculate only events
detect_event_event <- function(df, y = temp){
  ts_y <- eval(substitute(y), df)
  df$temp <- ts_y
  res <- detect_event(df)$event
  return(res)
  }

# Run an ANOVA on each metric of the combined event results and get the p-value
event_aov_p <- function(df){
  aov_models <- df[ , -grep("miss", names(df))] %>%
    map(~ aov(.x ~ df$miss)) %>% 
    map_dfr(~ broom::tidy(.), .id = 'metric') %>%
    mutate(p.value = round(p.value, 5)) %>%
    filter(term != "Residuals") %>%
    select(metric, p.value)
  return(aov_models)
  }

# Run an ANOVA on each metric of the combined event results and get CI
event_aov_CI <- function(df){
  # Run ANOVAs
  aov_models <- df[ , -grep("miss", names(df))] %>%
    map(~ aov(.x ~ df$miss)) %>% 
    map_dfr(~ confint_tidy(.), .id = 'metric') %>% 
    mutate(miss = as.factor(rep(c("0", "0.1", "0.25", "0.5"), nrow(.)/4))) %>% 
    select(metric, miss, everything())
  # Calculate population means
  df_mean <- df %>% 
    group_by(miss) %>%
    summarise_all(.funs = mean) %>%
    gather(key = metric, value = conf.mid, -miss)
  # Correct CI for first category
  res <- aov_models %>%
    left_join(df_mean, by = c("metric", "miss")) %>%
    mutate(conf.low = if_else(miss == "0", conf.low - conf.mid, conf.low),
           conf.high = if_else(miss == "0", conf.high - conf.mid, conf.high)) %>%
    select(-conf.mid)
  return(res)
  }

# A particular summary output
event_output <- function(df){
  res <- df %>%
    group_by(miss) %>% 
    # select(-event_no) %>% 
    summarise_all(c("mean", "sd"))
  return(res)
  }

Random missing data

Knocking out data

First up we begin with the random removal of increasing proportions of the data. We are going to use the full 33 year time series for these missing data experiments. For now we will randomly remove, 0%, 10%, 25%, and 50% of the data from each of the three times series. This is being repeated 100 times to allow for more reproducible results. If this approach looks promising, it will be repeated at every 1% step from 0% to 50% missing data to produce a more precise estimate of where the breakdown in accurate detection begins to occur.

Consecutive missing days

With some data randomly removed, we now see how consecutive these missing data are.

# Calculate the consecutive missing days
sst_ALL_con <- sst_ALL_miss %>% 
  filter(miss != "0") %>% 
  group_by(site, miss, rep) %>% 
  nest() %>%
  mutate(data2 = map(data, con_miss)) %>% 
  select(-data) %>%
  unnest()
save(sst_ALL_con, file = "data/sst_ALL_con.Rdata")
load("data/sst_ALL_con.Rdata")

It is reassuring to see that for the 10% and 25% missing data time series there are few instances of consecutive days of missing data greater than two. This is good because the default MHW detection algorithm setting is to interpolate over days with two or fewer missing values. It is therefore unlikely that one or two consecutive days of missing data will affect the resultant clims or MHWs.

Seasonal signal

# Calculate the thresholds for all of the created time series
sst_ALL_clim <- sst_ALL_miss %>% 
  group_by(site, miss, rep) %>% 
  nest() %>% 
  mutate(clim = map(data, ts2clm, climatologyPeriod = c("1982-01-01", "2014-12-31"))) %>% 
  select(-data) %>% 
  unnest()
save(sst_ALL_clim, file = "data/sst_ALL_clim_miss.Rdata")

As we may see in the figure above, there is no perceptible difference in the seasonal signal created from time series with varying amounts of missing data for the Mediterranean (Med) and North West Atlantic (NW_Atl) times series. The signal does appear to vary both above and below the real signal at the 50% missing data mark for the Western Australia (WA) time series.

90th percentile threshold

In the figure above we are able to see that the 90th percentile threshold is lower for the time series missing 50% of their data, but that 10% and 25% do not appear different. This is most pronounced for the Western Australia (WA) time series. This will likely have a significant impact on the size of the MHWs detected.

Climatology statistics

ANOVA p-values

Before we calculate the MHW metrics, let’s look at the output of an ANOVA comparing the seasonal and threshold values against one another across the different proportions of missing data for each time series.

If one were to compare this heatmap to those generated by reducing time series length, one would see that the effect of shorter time series on discrepancies between the seasonal signals is much greater than for missing data. The difference for the 90th percentile thresholds from missing data is however significant, and is a more pronounced effect than that created by shorter time series.

(ECJO)[If you think on the problem this is what we expect, on average. If we randomly remove data from a fixed sample then we are just as likely to remove data above and below the mean, so the sample mean will not change. However, as we remove data we will reduce the size of the tails (on average) i.e. they will contract towards the median of the distribution. This will lead to an under-estimate of the 90th percentile, and an over-estimate of the 10th percentile. I never though about this before. But it means that as more data is missing in a time series, it will reduce the threshold for what is considered a MHW. Maybe we can come up with a recommended correction to the threshold given a proportion of missing data?]

(RWS)[This is definitely something I would like to quantify. It would be ideal if this ratio could be provided in the best practices bullet points.]

Confidence intervals

Whereas p-values do have a use, I find confidence intervals to allow for a better judgement of the relationship between categories of data. In the figure above we see which missing proportions of data set the time series furthest apart. For the seasonal signals (seas) we see that an increase in missing data does increase the uncertainty in the signal, but it doesn’t cause the temperatures in the signal to become warmer or colder. With the 90th percentile threshold (thresh) however, the time series missing 50% of their data have temperature thresholds significantly lower than time series with no missing data. The other missing data proportions also show generally lower thresholds, but they are not significantly so. When we look at the variance (var) panel we see why this may be. At 10% missing data the variance of the measured data does not diverge significantly from the complete data. At 25% it already has, and at 50% the difference is enormous. This appears to show that because the seasonal signal is a smooth composite of the available data the variance in the measurements does not have a large effect on it. The 90th percentile threshold is a different story. For the same reason that bootstrapping did not work for the selection of random years of data in the shorter time series experiments, large amounts of missing data likewise prevent the accurate creation of the 90th percentile thresholds in these experiments. The reason for this is that many of the large (outlier-like, but real) values are not included in the calculation of the 90th percentile, which by necessity requires large values to be calculated accurately.

Kolmogorov-Smirnov

The figure above helps to break down the differences between the climatologies created through randomly missing data. This helps to explain why the seasonal signals do not differ significantly, but why the thresholds and variance distributions do. As we saw in the line graphs above, it is the Western Australia (WA) time series that seems to most quickly deviate from the real signal when missing data are introduced.

It is perhaps a lack of autocorrelation in a time series, or the intra-annual variance in the seasonal signal, that may be a good estimator for the effect that missing data will have on the accuracy of the climatologies calculated.

(ECJO)[??? why ???] (RWS)[I must provide a fuller explanation for my thinking here. A visualisation of the relationship would be ideal.]

MHW metrics

Now that we know that missing data from a time series does produce significantly different 90th percentile thresholds, we want to see what the downstream effects are on the MHW metrics.

# Calculate events
sst_ALL_event <- sst_ALL_clim %>% 
  group_by(site, miss, rep) %>% 
  nest() %>% 
  mutate(res = map(data, detect_event_event)) %>% 
  select(-data) %>% 
  unnest()
save(sst_ALL_event, file = "data/sst_ALL_event_miss.Rdata")
load("data/sst_ALL_event_miss.Rdata")

ANOVA

As we may see in the figure above, all of the metrics differ significantly when various amounts of missing data are introduced into the time series. To see where these differences materialise we will visualise the confidence intervals.

Confidence intervals

As with the climatology values, the MHW metrics from time series missing 50% of their data are significantly different from the control (0% missing), as well as the 10% and 25% missing data groups. There is generally a decrease in the magnitude of the metrics reported as one increases the amounts of missing data, with the exception of the North West Atlantic (NW_Atl) time series, which tends to have larger values reported at 10% missing data. This is important to note as it means that one may not say with certainty that the more data are missing from a time series, the less intense the MHWs detected will be. Though this is generally the case.

Perhaps the increase in max intensity at 10% missing data for the Northwest Atlantic time series is due to many cold-spells pulling the 90th percentile threshold down. Tough it would be odd that this would only emerge at 10% missing data. Remember though that these confidence intervals have been generated from 100 replicates, so it is unlikely this is just a chance effect.

(ECJO)[I would think that as the proportion of missing data is increased, the climatological mean stays the same (so anomalies are unchanged) but the threshold is reduced. Therefore, events will start and end at smaller temperature anomalies thus bringing down the mean intensity but raising the duration. You see this in the change from 0 to 10% missing, but it breaks down for >10% missing. This may be because as more data is missing the events are destroyed since the gaps are too large.]

(RWS)[That sounds about right to me. It’s just strange that this realy only comes through for the NW Atlantic.]

Consecutive missing days

(RWS: I think it would also be worthwhile to run a linear regression on the number of consecutive missing days in a time series against how much (ANOVA; p-value) the detected events differ from the control group.

Categories

(RWS: I’m not currently looking at the effect of missing data on MHW categories. But I can do if there is interest in it.)

Non-random missing data

There are two realistic instances of consistent missing data in SST time series. The first is for polar data when satellites cannot collect data because the sea surface is covered in ice. Whereas this would obviously preclude any MHWs from occurring, the question here is how do long stretches of missing data affect the creation of the climatologies, and therefore the detection of MHWs. The other sort of consecutive missing data is for those in situ time series that are collected by hand by (likely) a government employee that does not work on weekends. In the following two sub-sections we will knock-out data intentionally to simulate these two issues. Lastly, because we have seen above that consecutive missing days of data are problematic, win addition to looking at the effects of missing weekends (2 consecutive days), we will also look at the effect of 3 and 4 consecutive missing days. I assume that four constant consecutive missing days will render the time series void, but three may be able to squeak by.

Ice coverage

To simulate ice cover we will remove the three winter months from each time series. Because two of the three time series are from the northern hemisphere, the times removed will not be the same for all time series.

In the figure above we see that removing the winter months for the Mediterranean and Northwest Atlantic time series produces relatively even missing gaps near the bottom of each seasonal cycle. The Western Australia time series appears to have a much less regular seasonal signal and so the missing data periods vary somewhat more.

Climatology statistics

Before we begin comparing the climatology statistics for ‘ice coverage’ vs. complete time series, let’s see what the seasonal climatologies look like alongside each other.

We may see in the figure above that the calculated seasonal climatologies from the ‘ice coverage’ time series appear to be nearly identical to those from complete time series.

t-test p-values

For the following tests we are only going to compare data between the complete and ice coverage time series for days in which data exist. Doing otherwise would unfairly bias the results.

From the heatmap above we see that there is practically no difference in the climatologies calculated when there is ice coverage.

MHW metrics

I’m a little bit surprised at the results from the statistical climatology comparisons, but it makes sense if one thinks about it. Because the algorithm is not attempting to interpolate or otherwise fill any gaps, it does not actually change the data used for the climatology calculations. The result being that for the doy’s when climatologies are calculated, the results are the same as though there were no ice coverage at all.

The impact this may have on the MHWs (for the periods of the year where data are present) is something I am very much interested in seeing. I’m really not sure what the effect may be…

t-test

# Calculate events
suppressWarnings(
sst_ALL_ice_event <- sst_ALL_ice_clim %>% 
  group_by(site) %>% 
  nest() %>% 
  mutate(res = map(data, detect_event_event)) %>% 
  select(-data) %>% 
  unnest()
) # Must supress warnings caused by big gaps or the machine crashes

# Function for calculating t-tests on specific event metrics
event_ttest_p_ice <- function(df){
    df_comp <- sst_ALL_event %>% 
      filter(site == df$site[1], miss == "0", rep == "1") %>% 
      mutate(month = month(date_peak, label = T))
    if (df_comp$site[1] == "WA"){
      df_comp <- df_comp %>% 
        filter(!(month %in% c("Jul", "Aug", "Sep")))
    } else {
      df_comp <- df_comp %>% 
        filter(!(month %in% c("Jan", "Feb", "Mar")))
    }
    res <- data.frame(site = df$site[1],
                      duration = t.test(df$duration, df_comp$duration)$p.value,
                      intensity_mean = t.test(df$intensity_mean, df_comp$intensity_mean)$p.value,
                      intensity_max = t.test(df$intensity_max, df_comp$intensity_max)$p.value,
                      intensity_cumulative = t.test(df$intensity_cumulative, 
                                                    df_comp$intensity_cumulative)$p.value)
  return(res)
}

# The t-test results
sst_ALL_ice_event_ttest_p <- sst_ALL_ice_event %>%
  mutate(site_i = site) %>% 
  nest(-site_i) %>%
  mutate(res = map(data, event_ttest_p_ice)) %>% 
  select(-data) %>% 
  unnest() %>% 
  select(site:intensity_cumulative) %>% 
  gather(key = stat, value = p.value, -site)

# visualise
ggplot(sst_ALL_ice_event_ttest_p, aes(x = site, y = stat)) +
  geom_tile(aes(fill = p.value)) +
  geom_text(aes(label = round(p.value, 2))) +
  scale_fill_gradient2(midpoint = 0.1, limits = c(0, 1)) +
  theme(axis.text.y = element_text(angle = 90, hjust = 0.5))

The heatmap above shows how similar (p-values from t-tests) the MHW metrics are for events detected in complete vs. ice coverage time series. In order to allow the comparison to be more accurate, I removed MHWs from the complete time series whose peak intensity dates occurred during the ice coverage months. Were one to leave these events in, the p-values do decrease, with the max and mean intensities for events in the Mediterranean becoming significantly different from one another.

I don’t think it is correct to compare the full events though, which is why I’ve rather chosen to remove events that occurred during ice coverage before comparing the results.

I think these results show rather conclusively that long blocks of contiguous missing data in a time series (e.g. ice coverage) do not have an appreciable effect on the results of the climatologies or the resultant events detected. Of course, ice coverage will never be such a clean break. Rather, the random day or two of available data will encroach around the edges of the missing data, causing the MHW algorithm to need to interpolate across these smaller windows. That is an issue we will tackle in the section on missing n consecutive days, should we choose to go that in depth.

Missing weekends

(ECJO)[I’m not sure that this is telling us much different that would the from a 100*(2/7)% random removal test using the method from above…..]

(RWS)[I’ve included this as it is a real-life test case for in-situ collected data coming from a governmental/private organisation.]

Climatology statistics

I’ll not visualise the calculated climatologies here against the control as I don’t expect them to appear much different. Rather I’ll jump straight into the statistical comparisons.

t-test p-values

Because we are only comparing the distribution of two sample sets of data (no missing data vs. missing weekends), we will rather use a t-test to make the comparisons.

As with most of the results so far, the variances between the control and the missing weekend time series are significantly different. In this case though for only two of the three time series. The seasonal signals and 90th percentile thresholds do not differ appreciably.

As I’ve run a t-test here and not an ANOVA, I am not going to manually generate CI’s. Also, there are only two time series being compared here so CI’s aren’t as meaningful.

Kolmogorov-Smirnov

Consistently throughout we see that any minor changes to the time series have appreciable effects on the variance detected in the seasonal signal, but often not on the other climatology values.

MHW metrics

t-test

The figure above shows the results of t-tests for each metric between the complete time series and when the weekend temperature data have been removed. There is no appreciable impact on the detected duration/intensities.

Missing n-consecutive days

(RWS: I’m rather leaving this section for now in favour of looking at the relationship between the number of total consecutive missing days and the effect that has on the results.)

(ECJO)[I think that’s good. I think you can probably get this info from the first batch of analyses you did on random data removal.]

(RWS: If it is not shown in detail above, the relationship between n-missing consecutive days will be quantified here.)

Best practices

(RWS: I envision this last section distilling everything learned above into a nice bullet list. These bulleted items for each different vignette would then all be added up in one central best practices vignette that would be ported over into the final write-up in the discussion/best-practices section of the paper. Ideally these best-practices could also be incorporated into the R/python distributions of the MHW detection code to allow users to make use of the findings from the paper in their own work.)

Options for addressing many missing random data

Options for addressing many missing non-random data