This notebook is an expansion on the results from the paper Predicting the Present with Google Trends.. In this notebook, we are trying to predict the future as in predict the data for June 2020 in May 2020. The code can be modified to predict more into the future, as in predict the sales for the whole next year. ## Setup
library(scales)
library(tidyverse)
library(plotly)
library(lubridate)
library(broom)
library(modelr)
library(readxl)
library(zoo)
library(caret)
theme_set(theme_bw())
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
The proper version of CSV files with cleaned data is given below which I recreated for the replication of the results that the authors obtained. We modify it a little based on the description in the paper. This data is extracted from the original sources mentioned by the author in the paper. The format of Google Trends data has since changed and hence there is a very small difference between the authors results and the replicated results.
Sources:
merged_author <- read_csv("merged_present.csv")
merged_author$sales<-log(merged_author$sales)
merged_author$Period<-as.Date(merged_author$Period,"%m/%d/%Y")
rows <- nrow(merged_author)
The following part shows the making of models and the prediction of the baseline and trends data. It uses the pipeline format described by the author using K=96.
model_data <- merged_author
model_data <- model_data %>% mutate(lag2 = lag(sales,2),lag12=lag(sales,12),suvs1=lag(suvs,1),insurance1=lag(insurance,1))
model_base <- model_data
model_trends <- model_data
K <- 96
for (i in K:rows){
model_t <- model_data[1:i-1,]
model1_t <- lm(data = model_t, sales~lag2+lag12)
model2_t <- lm(data = model_t, sales~lag2+lag12+ suvs1 + insurance1)
model_base$sales[i] <- predict(model1_t,model_data[1:i,])[i]
model_trends$sales[i] <- predict(model2_t,model_data[1:i,])[i]
}
tidy(model1_t)
glance(model1_t)
tidy(model2_t)
glance(model2_t)
This shows the filtering of the proper data for plotting
model_actual <- model_data[K:rows,] %>% mutate(label="actual")
model_base <- model_base[K:rows,] %>% mutate(label="baseline")
model_trends <- model_trends[K:rows,] %>% mutate(label="trends")
This block shows the calculation of the Mean Absolute Error of the two models. We get an error of 4.80% for baseline model and 5.91% for the trends model.
MAE(model_base$sales,model_actual$sales)
## [1] 0.04814274
MAE(model_trends$sales,model_actual$sales)
## [1] 0.05910185
This block shows the calculation of the Mean Absolute Error for the recession period. Here we get 4.26% using baseline model and 5.42% using trends model. We also show the Rsquare values for each type of the data distribution.
period_start <- "2012-01-01"
period_end <- "2020-01-01"
part_trends_mod <- model_trends %>% filter(Period>=period_start& Period<=period_end)
part_base_mod <- model_base %>% filter(Period>=period_start& Period<=period_end)
part_actual_mod <- model_actual %>% filter(Period>=period_start& Period<=period_end)
MAE(part_base_mod$sales,part_actual_mod$sales)
## [1] 0.04264519
MAE(part_trends_mod$sales,part_actual_mod$sales)
## [1] 0.05425855
rsquare(model1_t,part_actual_mod)
## [1] 0.8677416
cor(part_base_mod$sales,part_actual_mod$sales)^2
## [1] 0.8536572
cor(part_trends_mod$sales,part_actual_mod$sales)^2
## [1] 0.8234873
This graph shows the replication of the figure shown in the paper, which shows the actual sales, prediction using baseline model and prediction using trends model.
model_plot_data <- rbind(model_actual, model_base, model_trends) %>% filter(Period>=period_start& Period<=period_end)
ggplotly(ggplot(model_plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(sales)')+
xlab('period'))