Introduction
Replicate results from 3.1 Motor Vehicles and Parts
- Replication with authors’ data
- Replication with raw data
Extensions
Conclusion

Introduction

We replicated some of the results from paper Predicting the Present with Google Trends.. We focused on 3.1 Motor Vehicles and Parts and were able to replicate everything in that session using original data. We also explored some extensions of the session, such as replicate the results with data from different time period, evaluate different models and forecast. We found a model that that can forecast sales for a whole year by using only previous years data. We tested the model and the \(R^2\) ranged from 0.88 to 0.97 for years 2014-2019. We made our predictions for the year of 2020. A flexdashboard that helped us evaluate and visualize different models was developed.

Replicate results from 3.1 Motor Vehicles and Parts

Replication with authors’ data

Replicate baseline seasonal AR-1 model

We began the replication process by using the data that was used for the paper. According to the paper, the model follows the form of \(y_t = b_1y_{t-1}+b_{12}y_{t-12}+e_t\)(AR-1 model). We used the codes below to obtain the same summary table of the model.

# load the data that was used for the paper
merged <- read_csv("merged.csv")

# take the log of all sales values
merged$sales<-log(merged$sales)

# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = merged, sales~lag(sales, 1)+lag(sales,12))

#the summary of the model
summary(model1)

## 
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12), data = merged)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.209554 -0.034684  0.002482  0.040477  0.220976 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.67266    0.76355   0.881 0.381117    
## lag(sales, 1)   0.64345    0.07332   8.776 3.59e-13 ***
## lag(sales, 12)  0.29565    0.07282   4.060 0.000118 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07985 on 76 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.7185, Adjusted R-squared:  0.7111 
## F-statistic:    97 on 2 and 76 DF,  p-value: < 2.2e-16

Replicate the model with trends

According to the paper, we added two trends, Trucks & SUVs and Automotive Insurance to the previous model. Once again, the summary we obtained is identical to the paper

# model with trends
model_with_trend <- lm(data = merged, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
# model.matrix(lm(data = merged, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance))

# summary table of the model
summary(model_with_trend)

## 
## Call:
## lm(formula = sales ~ lag(sales, 1) + lag(sales, 12) + suvs + 
##     insurance, data = merged)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.161327 -0.043774  0.002998  0.036651  0.159219 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.45798    0.78438  -0.584 0.561081    
## lag(sales, 1)   0.61947    0.06318   9.805 5.09e-15 ***
## lag(sales, 12)  0.42865    0.06535   6.559 6.45e-09 ***
## suvs            1.05721    0.16686   6.336 1.66e-08 ***
## insurance      -0.52966    0.15206  -3.483 0.000835 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06509 on 74 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.8179, Adjusted R-squared:  0.808 
## F-statistic: 83.08 on 4 and 74 DF,  p-value: < 2.2e-16

Replicate figure 2

The authors used rolling window forecast to confirm that the Google Trends data improves out-of-sample forecasting. What is rolling window forecast? That’s say now is July, 2005. We would use all the actual data that is available to train a model and predict August, 2005. We’ll keep predicted value aside. Imagine now is August, 2005, we have all the actual data(up to July, 2005) available to train another model to predict September, 2005. Repeat this process for Base model and Trends model.

A self-defined function for rolling window forecasting

rolling_window <- function(begin, start, end, model,data_set){
  all_data <- data_set %>% filter(Period >=begin & Period <= end)
  p <- data_set %>% filter(Period >=begin & Period <= end)
  s <- which(grepl(start, p$Period))
  e <- which(grepl(end, p$Period))
  for (i in s:e){
    train_data <- all_data[1:i-1,]
    model1 <- lm(data = train_data, as.formula(model))
    p$sales[i] <- predict(model1,all_data[1:i,])[i]
  }
p[s:e,]
}

Compute the numbers for plotting with the function

# beginning date
begin <-"2004-01-01"
# starting date for rolling-window-forecasting
start <- "2005-06-01"
# end date
end <- "2011-07-01"

# rolling window forecasting with base model
base <- rolling_window(begin, start, end, sales ~ lag(sales, 1)+lag(sales,12), merged) %>% 
  mutate(label = "base")

# rolling window forecasting with trends model
trends <- rolling_window(begin, start, end, sales ~ lag(sales, 1)+lag(sales,12) + insurance + suvs, merged) %>% 
  mutate(label = "trends")

# actual data
actual <- merged %>% filter(Period >= start & Period <= end) %>% 
  mutate(label ="actual")
# combine all the data
plot_data <- rbind(actual, base, trends)

Plot figure 2

ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index')+
  labs(subtitle ="MAE improvement: Overall = 10.66%  During recession = 21.46%" )

MSE

# MSE for Trends model
mean(abs(trends$sales-actual$sales))

## [1] 0.05667658

# MSE for Base model 
mean(abs(base$sales-actual$sales))

## [1] 0.06343984

\(R^2\)

# R^2 for base
(cor(base$sales,actual$sales))^2

## [1] 0.7241049

# R^2 for trends
(cor(trends$sales,actual$sales))^2

## [1] 0.7928897

MSE for recession period

# data for the recession period Dec 2007 to June 2009
recession_trends <- trends %>%
  filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_base <- base %>%
  filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_actual <- actual %>%
  filter(Period>="2007-12-01"& Period<="2009-06-01")

# means absolute error for the recession period
# MAE for the model with trends
mean(abs(recession_trends$sales-recession_actual$sales))

## [1] 0.06965812

# MAE for the model for baseline model
mean(abs(recession_base$sales-recession_actual$sales))

## [1] 0.08869325

Overall Improvement

(mean(abs(base$sales-actual$sales))-mean(abs(trends$sales-actual$sales)))/mean(abs(base$sales-actual$sales))

## [1] 0.1066089

Recession improvement

(mean(abs(recession_base$sales-recession_actual$sales))-mean(abs(recession_trends$sales-recession_actual$sales)))/mean(abs(recession_base$sales-recession_actual$sales))

## [1] 0.2146175

Note: the improvements they stated in the paragraph were different from what they labeled on the graph. First, I thought they used some kind of function, but the MAE() from library(caret) gives same value. Maybe a typo.

Replication with raw data

Where did we find the data?

The sales data: We first tried the link provided in the paper, but the data is adjusted. We were able to find the unadjusted data through this link. However, the values were different from the original data(used by the authors of the paper). The difference increases as year increases. The reason for that is still unclear. We will compare the two data set in the next session.

The search data: Trucks & SUVs and Auto Insurance

Comparing sales data used in the paper with data we obtained from census.gov

We noticed that the sales values from the data we obtained was always less than the data used in the paper. We plotted both data on the same graph.

we can see that the sales values from both original and unajusted data are very close to each other(almost overlapping).

ggplot(com_data, aes(x=Period, y = log(sales), color = label)) +
  geom_line()

When we plotted original data against unadjusted data, they almost lined up on the diagonal. Note that original is always less than unadjusted. I wonder if people went back and modified that data after the authors obtained the data. Further investigation is needed. In conclusion, the unadjusted data we found is very close to the original data.

ggplot(joined_data, aes(x = sales, y = unadjusted_sales)) +
  geom_point()+
  geom_abline(linetype = "dashed") +
  xlab('original') +
  ylab('unadjusted')+
  ggtitle("original sales against unadjusted sales")

Attempts to find the transformation on Google Trends data in original data

The authors did not specify what kind of transformation/normalization/standardization they performed on Google Trends data. Below are what we have tried to transform trends data.

new_value = (old_value-first_observed_value)/max(old_value)
Min-Max Normalization
Use scale()

None of these match with the original data. Therefore, we decided to process without any transformation on Google Trends data.

We also tried other key words like “Buy car”, “car dealer”, “Vehicle shop” and so on, but none of these were significant to the model

Replicate the results again with unadjusted data from census.gov with trends data from original paper

We would like to compare the outputs of unadjusted data with the outputs from the original data with same trends data. We repeated similar steps as before and obtained a graph similar to figure 2.

ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black","black", "red","red","grey","grey"))+
  scale_linetype_manual(values = c("solid","solid","dashed", "dashed", "solid","solid"))+
  ylab('log(mvp)')+
  xlab('Index')

From the graph, we see that lines are almost overlapped, and \(R^2\) are very close to the ones from original data, 0.7266 for the base and 0.7911028 for the one with trends.

Replicate the results with unadjusted data from census.gov and search data from Google Trends without unknown transformations

The current month’s trends data was significant to the model even without the transformation during the time period 01/01/2004 to 07/01/2011. We used current month’s trends data as an estimator of the first week of the month since weekly data was not available.

Clean the data for modeling

# trend 1 
# search data in category Trucks & SUVs between 01/01/2004 and 07/01/2011
suvs_trends <- read_csv("New_Trucks_suvs.csv")
names1<- names(suvs_trends)
suvs_trends <- suvs_trends %>%
  rename(suvs = names1[2],
         Period = Month)

# trend 2
# search data in category Auto Insurance between 01/01/2004 and 07/01/2011
insurance_trends <- read_csv("New_auto_insurance.csv")
names2 <- names(insurance_trends)
insurance_trends <- insurance_trends %>%
  rename(insurance = names2[2],
         Period = Month)

# census data 
unadj_full <- sales %>% filter(Period <= "2011-07-01")

#join trends data
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m")))

# join all data
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value) 

# take the log of sales
with_trends_full$sales = log(as.numeric(with_trends_full$sales))

Summary of Base model with raw data

model1 <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12))

m1 <- glance(model1)

tidy(model1)

Summary of Trends model with raw data (Trends data from previous month)

model_with_trend_pre_month <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12) +lag(suvs,1) + lag( insurance,1))

mtp<-glance(model_with_trend_pre_month)

tidy(model_with_trend_pre_month)

Summary of Trends model with raw data (Trends data from current month)

# model with current month's trends(we don't have weekly data, so we are using the monthly data instead)
model_with_trend_current <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)

mtc<-glance(model_with_trend_current)

tidy(model_with_trend_current)

The \(R^2\) for base model is 0.7212228, the \(R^2\) for the model with trends data from previous month is 0.7402116, and the \(R^2\) for the model with trends data from current month is 0.7830012. We see an improvement in overall fitness when trends data were from current month (since we don’t have access to the trends data of the first week of the month, we are using the monthly data as an estimator). We will further investigate if this is true with out-of-sample forecasting.

Evaluate the model by using rolling window forecast starting from 2005-06-01(with trends data from current month)

We evaluated the model by out-of-sample forecasting, and we found that the trends data made the predictions less accurate.

Compute the numbers for plotting

begin <-"2004-01-01"
start <- "2005-06-01"
end <- "2011-07-01"

# rolling window forecasting with base model
base <- rolling_window(begin, start, end, sales ~ lag(sales, 1)+lag(sales,12), with_trends_full) %>% 
  mutate(label = "base")

# rolling window forecasting with trends model
trends <- rolling_window(begin, start, end, sales ~ lag(sales, 1)+lag(sales,12) + insurance + suvs, with_trends_full) %>% 
  mutate(label = "trends")

# actual data
actual <- with_trends_full %>% filter(Period >= start & Period <= end) %>% 
  mutate(label ="actual")
# combine all the data
plot_data <- rbind(actual, base, trends)

Plot figure 2 with raw data

ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index'))

MSE

# MAE_base 
mean(abs(base$sales-actual$sales))

## [1] 0.06365923

# MAE_trends
mean(abs(trends$sales-actual$sales))

## [1] 0.06687234

\(R^2\)

#R^2 for base
(cor(base$sales,actual$sales))^2

## [1] 0.7266

# R^2 for trends
(cor(trends$sales,actual$sales))^2

## [1] 0.7360671

MSE during the recession period

recession_trends <- trends %>% 
  filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_base <- base %>% 
  filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_actual <- actual %>% 
  filter(Period>="2007-12-01"& Period<="2009-06-01")

# MAE_recession_trends
MAE_recession_trends <- mean(abs(recession_trends$sales-recession_actual$sales))
MAE_recession_trends

## [1] 0.0679496

# MAE_recession_base 
MAE_recession_base <- mean(abs(recession_base$sales-recession_actual$sales))
MAE_recession_base

## [1] 0.08903226

Overall improvement

overall_improv <- (mean(abs(base$sales-actual$sales))-mean(abs(trends$sales-actual$sales)))/mean(abs(base$sales-actual$sales))
overall_improv

## [1] -0.05047349

Improvement during the recession

recession_improv <- (mean(abs(recession_base$sales-recession_actual$sales))-mean(abs(recession_trends$sales-recession_actual$sales)))/mean(abs(recession_base$sales-recession_actual$sales))
recession_improv

## [1] 0.236798

The MAE for base model is 0.0636592, and the MAE for the model with trends is 0.0668723. The overall improvement is -5.0473494%. That means adding trends to the model make things worse. We see a totally different story when focus on recession period. The MAE for base model is 0.0890323, and the MAE for the model with trends is 0.0679496 during the recession. There is 23.6798011% improvement after adding trends to the model.

Extensions

Replication with data between 08/01/2011- 12/01/2019

The model produced higher \(R^2\) for time period 08/01/2011 to 12/01/2019. However, the trends data did not improve the performance.

Clean data

suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>% 
  rename(suvs = names1[2],
         Period = Month)
insurance_trends <- insurance_trends %>% 
  rename(insurance = names2[2], 
         Period = Month)

start_month <- "2011-08-01" # "2004-01-01"
end_month <- "2019-12-01"   # "2020-05-01"

unadj_full <- sales %>% filter(Period <= end_month & Period >=start_month)
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m"))) 
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value) 
with_trends_full$sales = log(as.numeric(with_trends_full$sales))

Summary of Base model

# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12))

tidy(model1)

Summary of Trends model

# Use trends data from current as an estimator for trends data from first week of the month
model_with_trend <- lm(data = with_trends_full, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)

tidy(model_with_trend)

Plot Figure 2

ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index'))

For data between 2013-01-01 and 2019-12-01. We don’t see much improvement on the fitness of the model when trends data was included. When we applied rolling window forecasting, we got \(R^2\) = 0.9375906 for baseline model, and \(R^2\) = 0.9168464 for the model with trends data. This is suggesting that adding trends data to the model may decreases prediction accuracy. We also tried this procedure with data from different time period, the results are similar.

Forecasting with new model

We would like to explore new models for forecasting. We examined two models. At the end, We tried to predict the sales of July, 2020 using these models.

Model 1: \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + e_t\)

We wanted to forecast by using the baseline model \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + e_t\) and the model \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + b_3insurance + b_4SUVs + e_t\) (search data from month t). We used rolling window forecast to evaluate our model(from 01/01/2017 to 01/01/2019) .

Clean data

## getting data ready
suvs_trends <- read_csv("trucks_suvs_2004_2020.csv")
insurance_trends <- read_csv("auto_insurance_2004_2020.csv")
# rename
names1<- names(suvs_trends)
names2 <- names(insurance_trends)
suvs_trends <- suvs_trends %>% 
  rename(suvs = names1[2],
         Period = Month)

insurance_trends <- insurance_trends %>% 
  rename(insurance = names2[2], 
         Period = Month)
unadj_full <- sales %>% filter(Period <= "2020-05-01")


##Join data
trends_full <- left_join(insurance_trends, suvs_trends, by = "Period") %>% mutate(Period = as.Date(as.yearmon(Period, "%Y-%m"))) 
with_trends_full <- left_join(unadj_full, trends_full, by = "Period") %>% rename(sales = Value) 
with_trends_full$sales = log(as.numeric(with_trends_full$sales))
#with_trends_full$sales = as.numeric(with_trends_full$sales)

Computing the numbers for plotting

begin <-"2013-01-01"
start <- "2017-01-01"
end <- "2019-01-01"

# rolling window forecasting with base model
base <- rolling_window(begin, start, end, sales~lag(sales, 2)+lag(sales,14), with_trends_full) %>% 
  mutate(label = "base")

# rolling window forecasting with trends model
trends <- rolling_window(begin, start, end, sales~lag(sales, 2)+lag(sales,14)+ lag(suvs,2) + lag(insurance,2), with_trends_full) %>% 
  mutate(label = "trends")

# actual data
actual <- with_trends_full %>% filter(Period >= start & Period <= end) %>% 
  mutate(label ="actual")
# combine all the data
plot_data <- rbind(actual, base, trends)

Plot figure 2

ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index'))

MSE

# MAE for trends 
MAE(trends$sales, actual$sales)

## [1] 0.06163394

#MAE for baseline model
MAE(base$sales, actual$sales)

## [1] 0.06326213

\(R^2\)

#R^2 for baseline model
(cor(base$sales,actual$sales))^2

## [1] 0.008848728

# R^2 for trends
(cor(trends$sales,actual$sales))^2

## [1] 0.0316904

Overall Improvement

(MAE(base$sales, actual$sales)-MAE(trends$sales, actual$sales))/MAE(base$sales, actual$sales)

## [1] 0.02573714

Both models performed badly. We were curious about the reason. We plotted Period VS sales for data from 01/01/2004 through 05/01/2020.

Plot Sales

# yearly trends
ggplotly(ggplot(with_trends_full, aes(x = Period, y = exp(sales))) + geom_line())

From this graph, we observed annual trend. The beginning and the end of a year tend to be the lowest point, mid-year tends to be the highest point, and the overall sales is increasing (exceptions during the recession and 2020). This explains why the model \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + e_t\) performed poorly. In this model, we used data from current month and 12 months ago to predict what would happen two month later. This violates the annual trend the sales data obeys.

Model 2: \(y_t=b_1y_{t-12}+b_4y_{t-48}+b_5insurance + b_6SUVs\)

Inspired by the annual trend, we want to test a different model for forecasting. \(y_t=b_1y_{t-12}+b_2y_{t-24}+b_3y_{t-36}+b_4y_{t-48}+b_5insurance + b_6SUVs\) Since there is a strong annual trend, we wonder if use the data from previous years of the same month as a predictors will give us a better model.

we used data from 2012-01-01 through 2018-12-01 to train the model, and predict what would the sales be for the whole year of 2019.

Note: we tried to use data from different time periods as training data for the model. It appears that use data from previous 6 years to train the model is the best.

Model 2.1 \(y_t=b_1y_{t-12}+b_2y_{t-24}+b_3y_{t-36}+b_4y_{t-48}+b_5insurance + b_6SUVs\)

Summary of Base model

# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = with_trends_full, sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48))

tidy(model1)

Summary of Trends model

# Use trends data from current as an estimator for trends data from first week of the month
model_with_trend <- lm(data = with_trends_full, sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))

tidy(model_with_trend)

Clean data

begin <-"2004-01-01"
start <- "2010-01-01"
end <- "2020-05-01"

# rolling window forecasting with base model
base <- rolling_window(begin, start, end, sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48), with_trends_full) %>% 
  mutate(label = "base")

# rolling window forecasting with trends model
trends <- rolling_window(begin, start, end, sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12), with_trends_full) %>% 
  mutate(label = "trends")

# actual data
actual <- with_trends_full %>% filter(Period >= start & Period <= end) %>% 
  mutate(label ="actual")
# combine all the data
plot_data <- rbind(actual, base, trends)
#R^2 for baseline model
rsb <- paste("Base: R^2 = ", round((cor(base$sales,actual$sales))^2,digits = 4))
# R^2 for trends
rst <- paste("Trend: R^2 = ", round((cor(trends$sales,actual$sales))^2,digits = 4))

Plot figure 2

ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
    ggtitle(paste0(rsb,"\n",rst))+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index'))

Model 2.2 \(y_t=b_1y_{t-12}+b_4y_{t-48}+b_5insurance + b_6SUVs\)

Based on the summary of Model 2.1, we wanted to test the performance of a new model without data from two and three years ago.

Summary of Base model

# apply lm(). lag() is used to capture y_t-1 and y_t-12
model1 <- lm(data = with_trends_full, sales~lag(sales, 12)+lag(sales, 48))

tidy(model1)

Summary of Trends model

# Use trends data from current as an estimator for trends data from first week of the month
model_with_trend <- lm(data = with_trends_full, sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))

tidy(model_with_trend)

Clean data

begin <-"2004-01-01"
start <- "2010-01-01"
end <- "2020-05-01"

trends_model <- sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12)
base_model <- sales~lag(sales, 12)+lag(sales, 48)
# rolling window forecasting with base model
base <- rolling_window(begin, start, end, base_model, with_trends_full) %>% 
  mutate(label = "base")

# rolling window forecasting with trends model
trends <- rolling_window(begin, start, end, trends_model, with_trends_full) %>% 
  mutate(label = "trends")

# actual data
actual <- with_trends_full %>% filter(Period >= start & Period <= end) %>% 
  mutate(label ="actual")
# combine all the data
plot_data <- rbind(actual, base, trends)
#R^2 for baseline model
rsb <- paste("Base: R^2 = ", round((cor(base$sales,actual$sales))^2,digits = 4))
# R^2 for trends
rst <- paste("Trend: R^2 = ", round((cor(trends$sales,actual$sales))^2,digits = 4))

Plot figure 2

ggplotly(ggplot(plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
    ggtitle(paste0(rsb,"\n",rst))+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(mvp)')+
  xlab('Index'))

Since the \(R^2\) is higher for Model 2.2, we believed that data from one and four years ago were better predictors. We also developed a flexdashboard that evaluates and visualizes different models related for the data set.

Predict sales of 2019 using Model 2.2

# Used data from 2012-01-01 through 2018-12-01 to train the model
training_data <- with_trends_full %>% 
  filter(Period >="2012-01-01"& Period <="2018-12-01")

# I'm using trends data from previous year of the same month for training
model_with_trend_all <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 24)+lag(sales, 36)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))
#summary(model_with_trend_all)

model_with_trend_2 <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))

model_without_trend <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48))

# we want to predict whole year of 2019, so we need data from 2015-01-01 and after 

#Change here to see the graph of that model
model <- model_with_trend_2

test_data <- with_trends_full %>%
  filter(Period >="2015-01-01" & Period <= "2019-12-01")
predictions <- test_data %>%
  add_predictions(model) %>% 
  filter(!(is.na(pred)) )


MSE <- mean((predictions$pred -predictions$sales)^2)

ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
  geom_point() +
  geom_abline()+
  xlab("Sales")+
  ylab("Prediction"))

(cor(predictions$sales, predictions$pred))^2

## [1] 0.9379584

We tried three versions of this model, we found that model \(y_t=b_1y_{t-12}b_4y_{t-48}+b_5insurance + b_6SUVs\) makes better predictions for year 2019. MSE = 6.166375610^{-4} and \(R^2\) = 0.9379584. Since this model works especially well, I would like to know its performance on other years.

\(R^2\) = 0.9712654 for the predictions on 2015. The lowest \(R^2\) was 0.8812861 on 2017’s predictions for years from 2018-2014.

Test the model on 2018

# Used data from 2011-01-01 through 2017-12-01 to train the model
training_data <- with_trends_full %>% 
  filter(Period >="2011-01-01"& Period <="2017-12-01")

model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48))
            #+ lag(insurance,12) + lag(suvs,12))

test_data <- with_trends_full %>%
  filter(Period >="2014-01-01" & Period < "2019-01-01")

predictions <- test_data %>%
  add_predictions(model) %>% 
  filter(!(is.na(pred)) )

ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
  geom_point() +
  geom_abline()+
  xlab("Sales")+
  ylab("Prediction"))

(cor(predictions$sales, predictions$pred))^2

## [1] 0.9303015

Test the model on 2017

# Used data from 2010-01-01 through 2016-12-01 to train the model
training_data <- with_trends_full %>% 
  filter(Period >="2010-01-01"& Period <="2016-12-01")

model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))

test_data <- with_trends_full %>%
  filter(Period >="2013-01-01" & Period < "2018-01-01")

predictions <- test_data %>%
  add_predictions(model) %>% 
  filter(!(is.na(pred)) )

ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
  geom_point() +
  geom_abline()+
  xlab("Sales")+
  ylab("Prediction"))

(cor(predictions$sales, predictions$pred))^2

## [1] 0.8812861

Test the model on 2016

# Used data from 2009-01-01 through 2015-12-01 to train the model
training_data <- with_trends_full %>% 
  filter(Period >="2009-01-01"& Period <="2015-12-01")

model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))

test_data <- with_trends_full %>%
  filter(Period >="2012-01-01" & Period < "2017-01-01")

predictions <- test_data %>%
  add_predictions(model) %>% 
  filter(!(is.na(pred)) )

ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
  geom_point() +
  geom_abline()+
  xlab("Sales")+
  ylab("Prediction"))

(cor(predictions$sales, predictions$pred))^2

## [1] 0.9002892

Test the model on 2015

# Used data from 2008-01-01 through 2014-12-01 to train the model
training_data <- with_trends_full %>% 
  filter(Period >="2008-01-01"& Period <="2014-12-01")

model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))

test_data <- with_trends_full %>%
  filter(Period >="2011-01-01" & Period <= "2015-12-01")

predictions <- test_data %>%
  add_predictions(model) %>% 
  filter(!(is.na(pred)) )

ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
  geom_point() +
  geom_abline()+
  xlab("Sales")+
  ylab("Prediction"))

(cor(predictions$sales, predictions$pred))^2

## [1] 0.9712654

Test the model on 2014

# Used data from 2007-01-01 through 2014-12-01 to train the model
training_data <- with_trends_full %>% 
  filter(Period >="2007-01-01"& Period <="2013-12-01")

model <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))

test_data <- with_trends_full %>%
  filter(Period >="2010-01-01" & Period < "2017-01-01")

predictions <- test_data %>%
  add_predictions(model) %>% 
  filter(Period >= "2014-01-01" )

ggplotly(ggplot(predictions, aes(x=exp(predictions$sales), y = exp(predictions$pred)))+
  geom_point() +
  geom_abline()+
  xlab("Sales")+
  ylab("Prediction"))

(cor(predictions$sales, predictions$pred))^2

## [1] 0.916022

In conclusion, this model works well when predicting other years as well.

Predict June and July 2020 with various models

We want to predict June and/or July with the models we examined above. The tricky part is that how much data should we use to train the model. We believe that using more recent data would be more helpful. To decide what data we would use to train the model, we were changing the training data and test it by predicting the sales of 2019. When we are satisfied with the model, we use it to predict June and/or July. We don’t expect the prediction to be accurate since 2020 is such a special year

1. With model \(y_t = b_1y_{t-1}+b_12y_{t-12}+e_t\)(AR-1 model)

# We are making prediction of 2019(whole year) at the begining of 2019 when the data for 2018-12 is available.

# played with different start month, 2016 gives the higher R^2
start <- "2016-01-01"
end <- "2018-12-01"
training_data <- with_trends_full %>% 
  filter(Period >=start& Period <=end)
model1 <- lm(data = training_data, sales~lag(sales, 1)+lag(sales,12))
testing_data <- with_trends_full %>% filter(Period >="2018-01-01")# 12 month before 2019-01-01
# summary(model1)

predicted <- testing_data %>% 
  add_predictions(model1) %>% 
  filter(!(is.na(pred)))

MSE <- mean((predicted$sales -predicted$pred)^2)
(cor(predicted$sales, predicted$pred))^2

## [1] 0.07932483

n <- tidy(model1)


#prediction for June 2020 
sales_2020_05 <- as.numeric(with_trends_full %>% filter(Period == "2020-05-01") %>% select(sales))
sales_2019_06 <- as.numeric(with_trends_full %>% filter(Period == "2019-06-01") %>% select(sales))
June <- exp(n$estimate[1] +n$estimate[2]*sales_2020_05+n$estimate[3]*sales_2019_06)
June

## [1] 106580

# prediction for July 
sales_2019_07 <- as.numeric(with_trends_full %>% filter(Period == "2019-07-01") %>% select(sales))
July <-  exp(n$estimate[1] +n$estimate[2]*log(June)+n$estimate[3]*sales_2019_07)
July

## [1] 110945.9

2. With model \(y_t = b_1y_{t-2}+b_2y_{t-14}+e_t+b_3insurance+b_4SUVs\)

start <- "2016-01-01"
end <- "2018-12-01"
training_data <- with_trends_full %>% 
  filter(Period >=start& Period <=end)
model1 <- lm(data = training_data, sales~lag(sales, 2)+lag(sales,14) + insurance + suvs)
testing_data <- with_trends_full %>% filter(Period >="2017-10-01")# 14 month before 2019-01-01

summary(model1)

## 
## Call:
## lm(formula = sales ~ lag(sales, 2) + lag(sales, 14) + insurance + 
##     suvs, data = training_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.136168 -0.010870  0.003921  0.026308  0.078363 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    11.007491   2.006196   5.487 4.01e-05 ***
## lag(sales, 2)   0.570992   0.526575   1.084    0.293    
## lag(sales, 14) -0.574034   0.527727  -1.088    0.292    
## insurance       0.000679   0.003944   0.172    0.865    
## suvs            0.005389   0.003543   1.521    0.147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05641 on 17 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.2592, Adjusted R-squared:  0.08492 
## F-statistic: 1.487 on 4 and 17 DF,  p-value: 0.25

predicted <- testing_data %>% 
  add_predictions(model1) %>% 
  filter(!(is.na(pred)))

MSE <- mean((predicted$sales -predicted$pred)^2)
(cor(predicted$sales, predicted$pred))^2

## [1] 0.09990959

n <- tidy(model1)


#prediction for June 2020 
sales_2020_05 <- as.numeric(with_trends_full %>% filter(Period == "2020-05-01") %>% select(sales))
sales_2019_05 <- as.numeric(with_trends_full %>% filter(Period == "2019-05-01") %>% select(sales))
June <- exp(n$estimate[1] +n$estimate[2]*sales_2020_05+n$estimate[3]*sales_2019_05)
June

## [1] 56090.57

3. With model \(y_t=b_1y_{t-12}+ b_4y_{t-48}+b_5insurance + b_6SUVs\)

We use this model to predicting the sales for every month in 2020 since this model only depends on year from previous years which is available to us now.

start <- "2013-01-01"
end <- "2019-12-01"
training_data <- with_trends_full %>% 
  filter(Period >=start& Period <=end)

model1 <- lm(data = training_data , sales~lag(sales, 12)+lag(sales, 48)+ lag(insurance,12) + lag(suvs,12))

# testing data has period up to 2020-12-01, these are place holders for predicted values 
data_for_predict <- sales %>% 
  select(Period) %>% 
  left_join(with_trends_full, by = "Period") %>% 
  filter(Period >="2016-01-01")# 48 month before 2020-01-01

#summary(model1)

predicted <- data_for_predict %>% 
  add_predictions(model1) 

pre <- predicted %>% 
  select(Period,sales,pred) %>%
  mutate(sales= exp(sales), pred=exp(pred)) %>% 
  filter(Period >= "2020-01-01" & Period <="2020-12-01")

kable(pre,align = "ccc", caption = "Prediction for the sales of year 2020")

Prediction for the sales of year 2020
Period	sales	pred
2020-01-01	93268	89911.12
2020-02-01	97859	93609.86
2020-03-01	82677	113541.18
2020-04-01	69889	107428.71
2020-05-01	104684	115232.18
2020-06-01	NA	108674.46
2020-07-01	NA	112311.44
2020-08-01	NA	117267.22
2020-09-01	NA	104070.43
2020-10-01	NA	107700.79
2020-11-01	NA	105292.96
2020-12-01	NA	109684.09

Conclusion

In conclusion, this was a fun project. We used what we have learned in DS3 and applied to this project. This project gives us a chance to understand published paper at a different point of view. Smallest details that authors left out unintentionally can make the replication process much difficult than expected. It’s a valuable lesson for us, the future researchers, to learn. The hardest part of replicating results of published paper is obtaining the data.

Replication and Extensions of the paper Predicting the Present with Google Trends Session 3.1

Xiaona Zhou

Last edited: June 30 2020

Introduction

Replicate results from 3.1 Motor Vehicles and Parts

Replication with authors’ data

Replicate baseline seasonal AR-1 model

Replicate the model with trends

Replicate figure 2

A self-defined function for rolling window forecasting

Compute the numbers for plotting with the function

Plot figure 2

MSE

\(R^2\)

MSE for recession period

Overall Improvement

Recession improvement

Replication with raw data

Where did we find the data?

Comparing sales data used in the paper with data we obtained from census.gov

Attempts to find the transformation on Google Trends data in original data

Replicate the results again with unadjusted data from census.gov with trends data from original paper

Replicate the results with unadjusted data from census.gov and search data from Google Trends without unknown transformations

Clean the data for modeling

Summary of Base model with raw data

Summary of Trends model with raw data (Trends data from previous month)

Summary of Trends model with raw data (Trends data from current month)

Evaluate the model by using rolling window forecast starting from 2005-06-01(with trends data from current month)

Compute the numbers for plotting

Plot figure 2 with raw data

MSE

\(R^2\)

MSE during the recession period

Overall improvement

Improvement during the recession

Extensions

Replication with data between 08/01/2011- 12/01/2019

Clean data

Summary of Base model

Summary of Trends model

Plot Figure 2

Forecasting with new model

Model 1: \(y_{t+2}=b_1y_{t}+b_2y_{t-12} + e_t\)

Clean data

Computing the numbers for plotting

Plot figure 2

MSE

\(R^2\)

Overall Improvement

Plot Sales

Model 2: \(y_t=b_1y_{t-12}+b_4y_{t-48}+b_5insurance + b_6SUVs\)

Model 2.1 \(y_t=b_1y_{t-12}+b_2y_{t-24}+b_3y_{t-36}+b_4y_{t-48}+b_5insurance + b_6SUVs\)

Summary of Base model

Summary of Trends model

Clean data

Plot figure 2

Model 2.2 \(y_t=b_1y_{t-12}+b_4y_{t-48}+b_5insurance + b_6SUVs\)

Summary of Base model

Summary of Trends model

Clean data

Plot figure 2

Predict sales of 2019 using Model 2.2

Predict June and July 2020 with various models

1. With model \(y_t = b_1y_{t-1}+b_12y_{t-12}+e_t\)(AR-1 model)

2. With model \(y_t = b_1y_{t-2}+b_2y_{t-14}+e_t+b_3insurance+b_4SUVs\)

3. With model \(y_t=b_1y_{t-12}+ b_4y_{t-48}+b_5insurance + b_6SUVs\)

Conclusion