Bayesian Analysis of Google Trends

Introduction

Dataset

Inferential Goals

Model

Simulated data
Real data

Diagnostics and Convergence

1. Overall Convergence of the Empirical Distribution
2. Stationarity
3. Speed of Exploration of Target Support
4. Correlation
5. Proximity to i.i.d. simulation
6. Convergence of each individual empirical mean
ggmcmc

Inferential Findings

Point and Interval Estimation
Hypothesis testing

Model comparisons

Model Checking and Posterior Predictive Distribution

In-Sample
Out-of-Sample

Introduction

Google Trends data provides real-time insights into what people are searching for online. Google Trends has already been used effectively in public health sectors. For example, predicting trends in flu-related searches has helped health organizations anticipate outbreaks.

Predicting Google searches for the “Vehicles and Cars” category could offer valuable insights into consumer behavior, market trends, and broader economic conditions. Cars are considered durable goods, significant investments that consumers typically make when they are confident in their economic stability. Their demand is quite elastic in the short run. High interest in purchasing vehicles can indicate that consumers feel financially secure, reflecting broader economic health. On the other hand,reduced interest in purchasing high-value items like cars can be an early indicator of an economic slowdown or recession.

Dataset

We start by loading the file containing the weekly search interest for vehicles for the last 5 years (from 2019/07/07 to 2024/06/30). This data was obtained downloaded directly from the Google Trends website, selecting as country the United States and as category “Cars and vehicles”.

data <- read.csv("veicoli.csv", header = TRUE)
data$Week <- as.Date(data$Week, format = "%Y-%m-%d")
dim(data)

## [1] 262   2

head(data)

##         Week Interest
## 1 2019-07-07       79
## 2 2019-07-14       80
## 3 2019-07-21       83
## 4 2019-07-28       82
## 5 2019-08-04       81
## 6 2019-08-11       80

Looking at this plot it seems like a good idea to use a model that takes into account past values to predict future ones, i.e. an autoregressive model. We now need to establish whether our time series is stationary, as AR models assume that the time series is stationary, which means that its statistical properties do not change over time. In particular:

the mean of the series should not be a function of time, which means that the mean should be roughly constant though some variance can be modeled;
the variance of the series should not be a function of time (homoskedasticity);
the autocovariance of the terms should not be a function of time but only of the lag.

We will conduct the Augmented Dickey-Fuller test (ADF), a type of statistical test called a unit root test, which is conducted with the following assumptions:

The null hypothesis for the test is that the series is non-stationary, or series has a unit root.
The alternative hypothesis for the test is that the series is stationary, or series has no unit root. If the null hypothesis cannot be rejected then this test might provide evidence that the series is non-stationary.

print(adf.test(data_ts))

## 
##  Augmented Dickey-Fuller Test
## 
## data:  data_ts
## Dickey-Fuller = -3.4596, Lag order = 6, p-value = 0.04711
## alternative hypothesis: stationary

The ADF tells us that the series is non-stationary, which means we will have to transform it before passing it to our autoregressive model.

We’ll use the decompose method to determine what may need to be removed from the time series in order to fix it.

decomp <- decompose(data_ts)
plot(decomp)

This plot shows that our time series has a downward trend, as well as a constant seasonal component. Moreover, the time series shows heteroskedasticity, while autoregressive models assume homoskedasticity of the data. To manage it we can:

take a Box-Cox transformation, a family of power transformations aimed at stabilizing variance and making the data more closely approximate a normal distribution
remove the seasonal component
remove the trend

data_ts <- data_ts - decomp$seasonal - na.approx(decomp$trend, rule = 2)

lambda <-forecast::BoxCox.lambda(data_ts)

## Warning in guerrero(x, lower, upper): Guerrero's method for selecting a Box-Cox
## parameter (lambda) is given for strictly positive data.

y <- forecast::BoxCox(data_ts, lambda)

print(adf.test(y))

## Warning in adf.test(y): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  y
## Dickey-Fuller = -5.046, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

plot(y, main = "Transformed data", type = 'l')

There is now less evidence for non-stationarity now and we can reject the null hypothesis of the ADF test.

Now, we need to check to see where our de-trended time series contains autocorrelations. We can see the degree to which a time series is correlated with its past values by calculating the autocorrelation. We will also check the partial autocorrelation, which controls for the autocorrelation at all other lags and is helpful in determining the order of the AR.

par(mfrow = c(1, 2))

acf(y)
pacf(y)

The transformed series still shows a strongly positive correlations throughout the series. We can see also that the values of the PACF plot drop off quickly after the second lag.

Inferential Goals

Before talking about the model, we want to specify the inferential goals of our analysis.

Parameter Estimation: estimate the posterior distributions of the model parameters and summarize these distributions using posterior means, medians, and credible intervals.
Hypothesis Testing: test specific hypotheses about the parameters by examining their posterior distributions.
Model Checking and Validation: simulate data from the posterior predictive distribution and compare it with the observed data to check if the model can reproduce the key characteristics of the observed data.
Forecasting: generate future values of the time series based on the posterior distribution of the parameters. Use the posterior samples to simulate future values and obtain a distribution of forecasted values at each future time point.
Model Comparison: compare different models using the Deviance Information Criterion

Model

An autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term. An $AR(p)$ process can be modeled as $Y_t = \alpha + \sum_{i=1}^p \phi_i Y_{t-i} + \epsilon_t$

where

$\alpha$ is a constant (the overall mean parameter). It represents the baseline level of the time series when all lagged terms are zero.
$\phi_i$ are the model autoregression parameters. These coefficients represent the relationship between the current value of the time series and its previous values. They indicate the strength and direction of the influence of past values on the current value. For example, a positive $\phi_1$ means that if the previous value was high, the current value tends to be high as well.
$\epsilon_t$ represents the random noise or shocks in the time series. It is usually assumed to be white noise, meaning it has a mean of zero, constant variance and no autocorrelation.

In a Bayesian framework, we need to specify the likelihood function and the prior distributions for these parameters since they are treated as random variables.

For the $AR(p)$ model the likelihood is given by:

$Y_t | \alpha, \phi_1, ..., \phi_p, \sigma^2 ∼ N(\alpha + \sum_{i=1}^p \phi_i Y_{t-i}, \sigma^2)$

While the prior distributions could be chosen based on domain knowledge or common practices. One reasonable choice could be:

$\alpha ∼ N (0, 10)$
$\phi_t ∼ N(0, \frac{1}{4})$
$\sigma ∼ Unif(0, 10)$

since we don’t want values too big of $\phi_t$ and the standard deviation must always be positive.

As usual in Bayesian analysis, we will update these priors to posteriors using the likelihood function. To obtain the posterior distributions we will use MCMC, which allow us to sample from the posterior when it is difficult to compute directly.

Simulated data

Before creating a model for our data, let’s simulate two $AR(1)$ processes.

set.seed(123)
T = 200
t_seq = 1:T
sigma = 1
alpha = 0.5

phi = 0.6
y_sim = rep(NA,T)
y_sim[1] = rnorm(1,0,sigma)
for(t in 2:T) y_sim[t] <- rnorm(1, alpha + phi * y_sim[t-1], sigma)

phi = 1.1
y_sim_2 = rep(NA,T)
y_sim_2[1] = rnorm(1,0,sigma)
for(t in 2:T) y_sim_2[t] <- rnorm(1, alpha + phi * y_sim_2[t-1], sigma)

It is interesting to notice that if $|\phi| > 1$ the process diverges very quickly (it is not stationary). This gives us useful information about the prior distributions of the parameters. More generally, for an $AR(p)$ model to be stationary the roots of a specific polynomial equation involving the model’s parameters must all lie outside the unit circle in the complex plane.

Now we will try to recover the model parameters from the first simulated data.

sim_parameters =  c("alpha","phi","sigma")

sim_data <- list(Y = y_sim, T = length(y_sim), p= 1)

# Run the model
sim_model <- jags(data = sim_data,
                     parameters.to.save = sim_parameters,
                     model.file="model_sim.txt",
                     n.chains=2,
                     n.iter=1000,
                     n.burnin=200,
                     quiet = TRUE
                     )

print(sim_model)

## Inference for Bugs model at "model_sim.txt", fit using jags,
##  2 chains, each with 1000 iterations (first 200 discarded)
##  n.sims = 1600 iterations saved
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## alpha      0.595   0.103   0.393   0.526   0.593   0.661   0.797 1.001  1600
## phi        0.519   0.061   0.398   0.478   0.519   0.558   0.642 1.004  1600
## sigma      0.951   0.049   0.859   0.916   0.949   0.985   1.048 1.001  1600
## deviance 542.777   2.558 539.875 540.969 542.132 543.903 549.351 1.009   710
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 3.3 and DIC = 546.0
## DIC is an estimate of expected predictive error (lower deviance is better).

Indeed, by looking at the average values for the parameters we obtained, we can see they are close to the true ones. In particular, the true value are inside the $95\%$ credible intervals.

Next, we simulate an $AR(2)$ process and check again the ability of MCMC to recover the parameters from the data.

set.seed(123)

# True parameters
phi = c(0.4, -0.2)
sigma = 0.5
alpha = 0.4

y_sim_3 = rep(NA,T)
y_sim_3[1:2] = rnorm(2, 0, sigma)
for(t in 3:T) y_sim_3[t] <- rnorm(1, alpha + phi[1] * y_sim_3[t-1] + phi[2] * y_sim_3[t-2] , sigma)

plot(y_sim_3, type = 'l', main = "Simulated AR(2) Data")

p <- 2
T <- length(y_sim_3)

sim_data <- list(Y = y_sim_3, T = T, p = p)

params <- c("alpha", "phi","sigma")

inits <- list(
  list(alpha = 1, phi =  c(1, 1), sigma = 0.1), 
  list(alpha = 0.5, phi = c(0.5, 0.5),  sigma = 0.5),
  list(alpha = 0, phi = c(1, 0.5),  sigma = 1)
)

jags_sim <- jags(data=sim_data,
               inits=inits,
               model.file="model_ar_v1.txt",
               parameters.to.save = params,
               n.chains=3,
               n.iter=10000,
               n.burnin=2000,
               quiet = TRUE)

print(jags_sim)

## Inference for Bugs model at "model_ar_v1.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 2000 discarded), n.thin = 8
##  n.sims = 3000 iterations saved
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## alpha      0.448   0.054   0.348   0.411   0.449   0.484   0.558 1.002  1900
## phi[1]     0.308   0.071   0.167   0.262   0.308   0.354   0.448 1.001  3000
## phi[2]    -0.211   0.071  -0.350  -0.259  -0.210  -0.164  -0.072 1.002  1700
## sigma      0.478   0.025   0.432   0.460   0.476   0.495   0.529 1.001  3000
## deviance 267.205   2.949 263.574 264.974 266.511 268.583 274.338 1.001  3000
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 4.3 and DIC = 271.6
## DIC is an estimate of expected predictive error (lower deviance is better).

Even in this case the true parameters are inside the credible intervals and the estimated parameters are very close to the true ones.

Real data

Now, we can fit the model to our data. We will use an $AR(2)$ model, since the PACF cuts off at $2$ .

Important input parameters to JAGS are:

the number of markov chains to produce (n.chains), in order to avoid non-ergodic behavior, which can occur if the chain gets stuck and doesn’t explore all the possible areas of the parameter space
the burn-in iterations (n.burnin), to discard the first samples which might be unreliable

l <- 4
y_tr <- y[c(1:(length(y) - l))]
y_te <- y[c((length(y) - l + 1):length(y))]

p <- 2
T <- length(y_tr)

real_data <- list(Y = y_tr, T = T, p = p)

params <- c("alpha", "phi","sigma")

inits <- list(
  list(alpha = 1, phi =  c(1, 1), sigma = 0.1), 
  list(alpha = 0.5, phi = c(0.5, 0.5),  sigma = 0.5),
  list(alpha = 0, phi = c(1, 0.5),  sigma = 1)
)

jags.1 <- jags(data=real_data,
               inits=inits,
               model.file="model_ar_v1.txt",
               parameters.to.save = params,
               n.chains=3,
               n.iter=10000,
               n.burnin=2000,
               quiet = TRUE)

print(jags.1)

## Inference for Bugs model at "model_ar_v1.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 2000 discarded), n.thin = 8
##  n.sims = 3000 iterations saved
##           mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat
## alpha      -0.193   0.170   -0.531   -0.305   -0.192   -0.078    0.142 1.001
## phi[1]      0.578   0.061    0.456    0.537    0.578    0.618    0.701 1.003
## phi[2]      0.184   0.062    0.059    0.143    0.184    0.225    0.301 1.001
## sigma       2.701   0.120    2.472    2.617    2.696    2.781    2.945 1.001
## deviance 1233.912   2.843 1230.474 1231.778 1233.236 1235.315 1241.218 1.002
##          n.eff
## alpha     3000
## phi[1]     960
## phi[2]    3000
## sigma     3000
## deviance  1100
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 4.0 and DIC = 1237.9
## DIC is an estimate of expected predictive error (lower deviance is better).

Diagnostics and Convergence

First of all we check if the MCMC have converged.

1. Overall Convergence of the Empirical Distribution

To understand this we can start by looking at the Gelman-Rubin Diagnostic (R-hat), that compares the variance between multiple chains to the variance within each chain . R-hat values close to one generally indicate that the chains have converged. In our case all values are below this threshold. We can also look at the evolution of Gelman-Rubin’s shrink factor as the number of iterations increases, to really be sure it converges.

2. Stationarity

Next, we look at the traceplot, a plot of iterations (x) vs sampled values (y). If the chains show no trend and are well mixed, then it suggest convergence. Indeed, if a traceplot shows no trend it suggests that the chain has reached a stable state where the samples are being drawn from the stationary distribution of the parameter (stationarity implies that the distribution of the chain has converged to the target distribution). Good mixing indicates that the chain is not getting stuck and is exploring effectively the paramete

No patterns or irregularities are observed, and therefore we could assume convergence.

3. Speed of Exploration of Target Support

We then monitor autocorrelation to assess the correlation of the series with its lags. The idea is that rapid decay suggests good mixing and an efficient exploration of the target distribution.

In our case the autocorrelation immediately falls around zero for the first lag.

4. Correlation

Next we look at correlation between parameters. High correlation might suggest dependencies that affect the estimate of the parameters.

There is some correlation between the autoregressive parameters but this shouldn’t be an issue and is normal since the lagged terms are not independent of each other, but should still be noted.

5. Proximity to i.i.d. simulation

We can look at Effective Sample Size to evaluate the sampling efficiency. Higher ESS means that the MCMC are providing more independent information, suggesting that the chain is exploring in more depth the parameter space and thus the estimates are more reliable.

##    alpha deviance   phi[1]   phi[2]    sigma 
## 3096.074 2908.872 3492.404 3000.000 3000.000

Our ESS is close to the true sample size, which further suggests our chain has converged.

6. Convergence of each individual empirical mean

We can then examine the Monte Carlo Standard Error, an estimate of the inaccuracy of Monte Carlo samples regarding the expectation of posterior samples. It is essentially a standard deviation around the posterior mean of the samples. In other words, they measure the variation of the mean of the parameter of interest due to the simulation. If they are low compared to the corresponding estimated standard deviation, then the posterior mean was estimated with high precision. This is our case.

##   Parameter        MCSE         SD
## 1     alpha 0.003127058 0.17041526
## 2    phi[1] 0.001100422 0.06133645
## 3    phi[2] 0.001119163 0.06163908
## 4     sigma 0.002275473 0.12011944

ggmcmc

Finally using ggmcmc we can look at a collection of model diagnostics in a single pdf file.

S <- ggs(as.mcmc(jags.1))
ggmcmc(S)

## Plotting histograms

## Plotting density plots

## Plotting traceplots

## Plotting running means

## Plotting comparison of partial and full chain

## Plotting autocorrelation plots

## Plotting crosscorrelation plot

## Plotting Potential Scale Reduction Factors

## Plotting shrinkage of Potential Scale Reduction Factors

## Plotting Number of effective independent draws

## Plotting Geweke Diagnostic

## Plotting caterpillar plot

## Time taken to generate the report: 7.3 seconds.

Inferential Findings

Let’s examine the values of the parameters are inside the unit circle

plot(jags.1$BUGSoutput$sims.array[, 1,3:4])

Point and Interval Estimation

We can easily access the posterior mean and median of the parameters. Moreover, we can also look at the values for the $95\%$ credible intervals.

##                  mean         sd          2.5%          25%          50%
## alpha      -0.1929914 0.17041526   -0.53071344   -0.3046283   -0.1917961
## deviance 1233.9121745 2.84292566 1230.47353082 1231.7776053 1233.2358233
## phi[1]      0.5781155 0.06133645    0.45634555    0.5373586    0.5775741
## phi[2]      0.1836100 0.06163908    0.05864183    0.1429859    0.1838248
## sigma       2.7010551 0.12011944    2.47185826    2.6174854    2.6960791
##                    75%        97.5%     Rhat n.eff
## alpha      -0.07790053    0.1420972 1.000670  3000
## deviance 1235.31521349 1241.2182542 1.002361  1100
## phi[1]      0.61794091    0.7014968 1.002583   960
## phi[2]      0.22476071    0.3011408 1.000797  3000
## sigma       2.78141278    2.9453524 1.000607  3000

We can then build credible intervals like this:

ci_phi1_95 <- quantile(phi1_samples, c(0.025, 0.975))
ci_phi2_95 <- quantile(phi2_samples, c(0.025, 0.975))
ci_alpha_95 <- quantile(alpha_samples, c(0.025, 0.975))
ci_sigma_95 <- quantile(sigma_samples, c(0.025, 0.975))

# Print credible interval
cat("95% credible interval for phi[1]: ", ci_phi1_95, "\n",
    "95% credible interval for phi[2]: ", ci_phi2_95, "\n",
    "95% credible interval for alpha: ", ci_alpha_95, "\n",
    "95% credible interval for sigma: ", ci_sigma_95, "\n")

## 95% credible interval for phi[1]:  0.4563455 0.7014968 
##  95% credible interval for phi[2]:  0.05864183 0.3011408 
##  95% credible interval for alpha:  -0.5307134 0.1420972 
##  95% credible interval for sigma:  2.471858 2.945352

We can also compute the Highest Posterior Density intervals:

phi1_mcmc <- as.mcmc(phi1_samples)
phi2_mcmc <- as.mcmc(phi2_samples)
alpha_mcmc <- as.mcmc(alpha_samples)
sigma_mcmc <- as.mcmc(sigma_samples)

# Calculate HPD intervals
hpd_phi1 <- HPDinterval(phi1_mcmc, prob = 0.95)
hpd_phi2 <- HPDinterval(phi2_mcmc, prob = 0.95)
hpd_alpha <- HPDinterval(alpha_mcmc, prob = 0.95)
hpd_sigma <- HPDinterval(sigma_mcmc, prob = 0.95)

# Print HPD intervals
cat("95% HPD interval for phi[1]: ", hpd_phi1, "\n",
    "95% HPD interval for phi[2]: ", hpd_phi2, "\n",
    "95% HPD interval for alpha: ", hpd_alpha, "\n",
    "95% HPD interval for sigma: ", hpd_sigma, "\n")

## 95% HPD interval for phi[1]:  0.4563488 0.7015954 
##  95% HPD interval for phi[2]:  0.06749437 0.3068271 
##  95% HPD interval for alpha:  -0.5364906 0.1322612 
##  95% HPD interval for sigma:  2.464203 2.93622

We then plot the credible intervals to gain a better understanding:

mcmc_areas(jags.1$BUGSoutput$sims.matrix, pars = c("phi[1]" ,"phi[2]", "alpha"), prob = 0.95, point_est = c("mean"))

It can also be useful to look at the posterior distributions of the parameters to get a complete picture of the uncertainty associated with each parameter.

Hypothesis testing

We can now even perform hypothesis testing for the parameters using the credible intervals and the samples from the posterior distribution. In particular we just have to check if the point null hypothesis $\theta_0$ is inside the $0.95$ credible interval we computed before. If it is inside then decide in favor of $H_0$ , otherwise reject it.

null_value <- 0

if (null_value >= ci_phi1_95[1] && null_value <= ci_phi1_95[2]) {
  cat("Null hypothesis cannot be rejected.\n")
} else {
  cat("Null hypothesis is rejected.\n")
}

## Null hypothesis is rejected.

By looking at the distributions and the interval estimates we can conclude that all parameters are significantly different from zero. Moreover, we understand that the first lag has a stronger inlfuence than the second lag, with around $60\%$ of the value of $Y_{t-1}$ carried forward to $Y_t$ , while only $30\%$ of $Y_{t-2}$ .

This implies that the time series has short-term memory, with past values influencing future values but diminishing over time.

Model comparisons

If, instead, we were to use a classical / frequentist AR model, we would only get point estimates for the parameters, while as we have seen the Bayesian way provides posterior distributions for parameters, reflecting the uncertainty about their values.

Indeed, we show this difference by training a frequentist AR model and then comparing the results of the parameters estimates with the Bayesian approach.

ar_f <- ar(y_tr, aic = FALSE, order.max = 2)

We notice that the mean of the posterior distribution of the coefficients is very close to the frequentist estimate.

Is the $AR(2)$ model really the best choice as suggested by PACF plot ? To be sure we can try to use an $AR(3)$ model and then compare the DIC to see which model is better. The one with the lowest DIC will be the one we should choose.

real_data <- list(Y = y_tr, T = T, p = 3)

inits <- list(
  list(alpha = 1, phi =  c(1, 1, 1), sigma = 0.1), 
  list(alpha = 0.5, phi = c(0.5, 0.5, 0.5),  sigma = 0.5),
  list(alpha = 0, phi = c(1, 0.5, 1),  sigma = 1)
)

jags.2 <- jags(data=real_data,
               inits=inits,
               model.file="model_ar_v2.txt",
               parameters.to.save = params,
               n.chains=3,
               n.iter=10000,
               n.burnin=2000,
               quiet = TRUE)

print(jags.2)

## Inference for Bugs model at "model_ar_v2.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 2000 discarded), n.thin = 8
##  n.sims = 3000 iterations saved
##           mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat
## alpha      -0.236   0.175   -0.586   -0.355   -0.235   -0.117    0.103 1.001
## phi[1]      0.625   0.062    0.503    0.582    0.627    0.669    0.740 1.001
## phi[2]      0.311   0.071    0.170    0.265    0.312    0.358    0.450 1.001
## phi[3]     -0.221   0.061   -0.341   -0.262   -0.222   -0.181   -0.100 1.001
## sigma       2.640   0.119    2.422    2.557    2.636    2.714    2.885 1.001
## deviance 1217.157   3.260 1212.894 1214.771 1216.393 1218.891 1225.216 1.001
##          n.eff
## alpha     3000
## phi[1]    3000
## phi[2]    3000
## phi[3]    3000
## sigma     3000
## deviance  2800
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 5.3 and DIC = 1222.5
## DIC is an estimate of expected predictive error (lower deviance is better).

print(jags.1$BUGSoutput$DIC)

## [1] 1237.948

Indeed, the DIC is higher for the $AR(3)$ model which means the $AR(2)$ should be preferred.

DIC stands for Deviance Information Criterion and is a measure used for model selection and comparison. It balances the fit of the model (captured by the deviance) with its complexity captured by the effective number of parameters).

It is computed as

$DIC = D(\bar\theta) + 2 \cdot p_D$

The deviance is a measure of goodness of fit of the model to the data. It is defined as:

$D(\theta) = -2\log(p(y|\theta)) + C$

where $p(y|\theta)$ is the likelihood function. The lower the deviance, the better the fit.

The effective number of parameters $p_D = \frac{1}{2}Var[D(\theta)]$ accounts for the complexity of the model. The larger it is, the easier it is for the model to fit (possibly overfit) the data.

Model Checking and Posterior Predictive Distribution

To assess the adequacy of the chosen model in capturing the patterns and variability present in the observed data we will run a posterior predictive check.

Sampling parameters from the posterior distributions and using them to generate possible paths of the time series, it is possible to create a distribution of outcomes. This allows us to account for parameter uncertainty. Formally, the posterior predictive distribution for a new observation $y_{new}$ given the observed data $y$ is defined as:

$P(y_{new} | y) = \int P(y_{new} | \theta) P(\theta | y) d\theta$

This integral is approximated by using MCMC samples from the posterior distribution.

In-Sample

n_sim <- jags.1$BUGSoutput$n.sims
n_obs <- length(y_tr)

# Initialize matrix for simulated data
Y_sim <- matrix(NA, nrow = n_sim, ncol = n_obs)
resid_sim <- matrix(NA, nrow = n_sim, ncol = n_obs)

# Simulate new data
for (i in 1:n_sim) {
  alpha <- alpha_samples[i]
  phi1 <- phi1_samples[i]
  phi2 <- phi2_samples[i]
  sigma <- sigma_samples[i]
  
  Y_sim[i, 1:2] <- y_tr[1:2]
  resid_sim[i, 1:2] <- c(0, 0)
  for (t in 3:n_obs) {
    Y_sim[i, t] <- alpha + phi1 * Y_sim[i, t-1] + phi2 * Y_sim[i, t-2] + rnorm(1, 0, sigma)
    resid_sim[i, t] <- y_tr[t] - Y_sim[i, t]
  }
}

Out-of-Sample

Multiple steps ahead forecasting

We can generate forecasts for multiple future time steps by iteratively using the posterior samples of the model parameters.

# Number of future time steps to forecast
forecast_steps <- l

total_samples <- jags.1$BUGSoutput$n.sims
n_samples_to_use <- 2000

# Randomly select a subset of samples
set.seed(123)
subset_indices <- sample(1:total_samples, n_samples_to_use)

# Initialize a matrix to store forecast values
forecast_values <- matrix(NA_real_, nrow = forecast_steps + 2, ncol = n_samples_to_use)
forecast_values[1, ] = y_tr[length(y_tr) - 1]
forecast_values[2, ] = y_tr[length(y_tr)]

residuals <- matrix(NA, nrow = forecast_steps + 2, ncol = n_samples_to_use)

# Generate forecast values for each posterior sample
for (s in 1:n_samples_to_use) {
  alpha <- alpha_samples[s]
  phi1 <- phi1_samples[s]
  phi2 <- phi2_samples[s]
  sigma <- sigma_samples[s]
  
  for (row in (p+1):(forecast_steps + 2)) {
    noise <- rnorm(1, 0, sigma) 
    forecast_values[row, s] <- alpha + phi1 * forecast_values[row - 1, s] + phi2 * forecast_values[row - 2, s] + noise
    residuals[row, s] <- y_te[row-2] - forecast_values[row, s]
  }
}

#Median and mean
median_forecast <- apply(forecast_values, 1, median)
mean_forecast <- apply(forecast_values, 1, mean)

# Credible interval
lower_quantile <- apply(forecast_values[2:(forecast_steps + 2), ], 1, quantile, probs = 0.025)
upper_quantile <- apply(forecast_values[2:(forecast_steps + 2), ], 1, quantile, probs = 0.975)

At each forecast step we don’t get a single value, but a distribution of possible outcomes.

We can also compute the residuals and check for structural assumptions using appropriate tests and plots.

## Shapiro-Wilk normality test for residuals of forecast 1:

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals[3, ]
## W = 0.99945, p-value = 0.8564

If we want to make a forecast we can take, for example, the mean of the distribution of forecasts at each step and use this as the forecast for that step. If we do this we will notice we get values very close to the ones we would obtain with a frequentist $AR(2)$ model.

We can check the overall goodness of forecast and compare it to that of a frequentist approach.

## Frequentist AR Model RMSE and MAPE: 1.88934 199.8737 
## Bayesian AR Model RMSE and MAPE: 1.961034 174.6096

Forecasting using MCMC

If we set some of the values in our data list to the value “NA” JAGS will treat these missing data as additional parameters to be estimated. This is especially useful for time series as we can create extra values at the end of our series, and JAGS will magically turn these into future forecasts. We are essentially estimating the joint posterior predictive distribution for future/missing observations using MCMC.

n_forecasts <- l

p <- 2
T <- length(y_tr) + n_forecasts

real_data <- list(Y = c(y_tr, rep(NA, n_forecasts)),
                  T = T, p = p)

params <- c("Y")


jags_fore <- jags(data=real_data,
               model.file="model_ar_v1.txt",
               parameters.to.save = params,
               n.chains=3,
               n.iter=10000,
               n.burnin=2000,
               quiet = TRUE)

y_all <- jags_fore$BUGSoutput$sims.list$Y

If we want to go back to the original scale we can simply apply the Inverse Box-Cox transform to our predictions and add back the trend and the seasonality.

inv_b <- forecast::InvBoxCox(y_all_mean[(length(y) - l -1):length(y)], lambda = lambda)
inv_b <- inv_b + as.vector(decomp$seasonal)[(length(y) - l - 1):length(y)] + as.vector(na.approx(decomp$trend, rule = 2))[(length(y) - l - 1):length(y)]
inv_fr <- forecast::InvBoxCox(preds$pred, lambda) 
inv_fr <- inv_fr + as.vector(decomp$seasonal)[(length(y) - l + 1):length(y)] + as.vector(na.approx(decomp$trend, rule = 2))[(length(y) - l + 1):length(y)]

One-step-ahead Forecasts

We can also make predictions one step ahead using as coefficients the posterior mean of the parameters.

alpha_m <- jags.1$BUGSoutput$mean$alpha
phi1_m <- jags.1$BUGSoutput$mean$phi[1]
phi2_m <- jags.1$BUGSoutput$mean$phi[2]

t2 <- c(NA, y[1:(length(y)-2)])
fitted_values <- alpha_m + phi1_m * y[1:(length(y)-1)] + phi2_m * t2


# Plot actual vs fitted values
plot(as.vector(y), type='l', main = "One-step-ahead Forecast", 
     xlab = 'Time', ylab = 'Value', col = 'black', lwd = 2)

# Add fitted values
lines(2:length(y), fitted_values, col = 'red', lwd = 1.5)

legend("topright", legend = c("Actual", "Forecast"), 
       col = c("black", "red"), lwd = 2, lty = 1, cex = 1.2)

grid()