Predicting using the author data

This notebook is based on the replication of results from the paper Predicting the Present with Google Trends.

Setup

library(scales)
library(tidyverse)
library(plotly)
library(lubridate)
library(broom)
library(modelr)
library(readxl)
library(zoo)
library(caret)
theme_set(theme_bw())

knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Load Data

The proper version of CSV files with cleaned data is given below by the author for replication of the results that the authors obtained. We modify it a little based on the description in the paper.

merged_author <- read_csv("merged_author.csv")

merged_author$sales<-log(merged_author$sales)
merged_author$Period<-as.Date(merged_author$Period,"%m/%d/%Y")
rows <- nrow(merged_author)

Model Summary

The model1_a shows the author’s version of the baseline model which uses the values from the past month and also the data from a year before for prediction. Example, to predict the sales of June 2007, it uses the sales of May 2007 and June 2006.

model1_a <- lm(data = merged_author, sales~lag(sales, 1)+lag(sales,12))
tidy(model1_a)
glance(model1_a)

The model2_a shows the author’s version of the trends model which uses the values from the past month, the data from a year before and the data from Google Trends for prediction. Example, to predict the sales of June 2007, it uses the sales of May 2007 and June 2006. Along with that, it also uses the Google Trends data from the current month Jun 2007.

model2_a <- lm(data = merged_author, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
tidy(model2_a)
glance(model2_a)

Predict Current Month Data

The following part shows the making of models and the prediction of the baseline and trends data. It uses the pipeline format described by the author using K=17.

baseline_a <- merged_author
trends_a <- merged_author
for (i in 18:rows){
  merged_t_a <- merged_author[1:i-1,]
  model1_a <- lm(data = merged_t_a, sales~lag(sales, 1)+lag(sales,12))
  model2_a <- lm(data = merged_t_a, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
  baseline_a$sales[i] <- predict(model1_a,merged_author[1:i,])[i]
  trends_a$sales[i] <- predict(model2_a,merged_author[1:i,])[i]
}

Compile Data

This shows the filtering of the proper data for plotting

actual_a <- merged_author[18:rows,] %>% mutate(label="actual")
baseline_a <- baseline_a[18:rows,] %>% mutate(label="baseline")
trends_a <- trends_a[18:rows,] %>% mutate(label="trends")

Mean Absolute Error

This block shows the calculation of the Mean Absolute Error of the two models. We get an error of 6.34% for baseline model and 5.66% for the trends model, which is a 10.6% improvement which is also mentioned in the paper.

MAE(baseline_a$sales,actual_a$sales)
## [1] 0.06343984
MAE(trends_a$sales,actual_a$sales)
## [1] 0.05667658

Recession MAE

This block shows the calculation of the Mean Absolute Error for the recession period. Here we get 8.86% using baseline model and 6.96% using trends model, which is 21.4% improvement.

recession_trends_a <- trends_a %>% filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_base_a <- baseline_a %>% filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_actual_a <- actual_a %>% filter(Period>="2007-12-01"& Period<="2009-06-01")
MAE(recession_base_a$sales,recession_actual_a$sales)
## [1] 0.08869325
MAE(recession_trends_a$sales,recession_actual_a$sales)
## [1] 0.06965812

Plot

This graph shows the replication of the figure shown in the paper, which shows the actual sales, prediction using baseline model and prediction using trends model.

author_plot_data <- rbind(actual_a, baseline_a, trends_a)
ggplotly(ggplot(author_plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(sales)')+
  xlab('period'))

Predicting using the extracted data

Load Data

The proper version of CSV files with cleaned data is given below which I recreated for the replication of the results that the authors obtained. We modify it a little based on the description in the paper. This data is extracted from the original sources mentioned by the author in the paper. The format of Google Trends data has since changed and hence there is a very small difference between the authors results and the replicated results.

Sources:

Sales: Census,

Trucks & SUVs: Google Trends,

Auto Insurance: Google Trends

merged_extracted <- read_csv("merged_extracted.csv")

merged_extracted$sales<-log(merged_extracted$sales)
merged_extracted$Period<-as.Date(merged_extracted$Period,"%m/%d/%Y")
rows <- nrow(merged_extracted)

Model Summary

The model1_e shows the extracted compiled version of the baseline model which uses the values from the past month and also the data from a year before for prediction. Example, to predict the sales of June 2007, it uses the sales of May 2007 and June 2006.

model1_e <- lm(data = merged_extracted, sales~lag(sales, 1)+lag(sales,12))
tidy(model1_e)
glance(model1_e)

The model2_e shows the extracted compiled version of the trends model which uses the values from the past month, the data from a year before and the data from Google Trends for prediction. Example, to predict the sales of June 2007, it uses the sales of May 2007 and June 2006. Along with that, it also uses the Google Trends data from the current month Jun 2007.

model2_e <- lm(data = merged_extracted, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
tidy(model2_e)
glance(model2_e)

Predict Current Month Data

The following part shows the making of models and the prediction of the baseline and trends data. It uses the pipeline format described by the extracted using K=17 to replicate the results of the paper.

baseline_e <- merged_extracted
trends_e <- merged_extracted
for (i in 18:rows){
  merged_t_e <- merged_extracted[1:i-1,]
  model1_e <- lm(data = merged_t_e, sales~lag(sales, 1)+lag(sales,12))
  model2_e <- lm(data = merged_t_e, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
  baseline_e$sales[i] <- predict(model1_e,merged_extracted[1:i,])[i]
  trends_e$sales[i] <- predict(model2_e,merged_extracted[1:i,])[i]
}

Compile Data

This shows the filtering of the proper data for plotting

actual_e <- merged_extracted[18:rows,] %>% mutate(label="actual")
baseline_e <- baseline_e[18:rows,] %>% mutate(label="baseline")
trends_e <- trends_e[18:rows,] %>% mutate(label="trends")

Mean Absolute Error

This block shows the calculation of the Mean Absolute Error of the two models. We get an error of 6.36% for baseline model and 6.82% for the trends model, which is a 6.74% improvement which is also mentioned in the paper.

MAE(baseline_e$sales,actual_e$sales)
## [1] 0.06365923
MAE(trends_e$sales,actual_e$sales)
## [1] 0.0682912

Recession MAE

This block shows the calculation of the Mean Absolute Error for the recession period. Here we get 8.90% using baseline model and 6.82% using trends model, which is 23.3% improvement.

recession_trends_e <- trends_e %>% filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_base_e <- baseline_e %>% filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_actual_e <- actual_e %>% filter(Period>="2007-12-01"& Period<="2009-06-01")
MAE(recession_base_e$sales,recession_actual_e$sales)
## [1] 0.08903226
MAE(recession_trends_e$sales,recession_actual_e$sales)
## [1] 0.06821008

Plot

This graph shows the replication of the figure shown in the paper, which shows the actual sales, prediction using baseline model and prediction using trends model.

extracted_plot_data <- rbind(actual_e, baseline_e, trends_e)
ggplotly(ggplot(extracted_plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(sales)')+
  xlab('period'))