The proper version of CSV files with cleaned data is given below which I recreated for the replication of the results that the authors obtained. We modify it a little based on the description in the paper. This data is extracted from the original sources mentioned by the author in the paper. The format of Google Trends data has since changed and hence there is a very small difference between the authors results and the replicated results.
Sources:
merged_extracted <- read_csv("merged_extracted.csv")
merged_extracted$sales<-log(merged_extracted$sales)
merged_extracted$Period<-as.Date(merged_extracted$Period,"%m/%d/%Y")
rows <- nrow(merged_extracted)
The model1_e shows the extracted compiled version of the baseline model which uses the values from the past month and also the data from a year before for prediction. Example, to predict the sales of June 2007, it uses the sales of May 2007 and June 2006.
model1_e <- lm(data = merged_extracted, sales~lag(sales, 1)+lag(sales,12))
tidy(model1_e)
glance(model1_e)
The model2_e shows the extracted compiled version of the trends model which uses the values from the past month, the data from a year before and the data from Google Trends for prediction. Example, to predict the sales of June 2007, it uses the sales of May 2007 and June 2006. Along with that, it also uses the Google Trends data from the current month Jun 2007.
model2_e <- lm(data = merged_extracted, sales~lag(sales, 1)+lag(sales,12) + suvs + insurance)
tidy(model2_e)
glance(model2_e)
The following part shows the making of models and the prediction of the baseline and trends data. It uses the pipeline format described by the extracted using K=17 to replicate the results of the paper.
baseline_e <- merged_extracted
trends_e <- merged_extracted
for (i in 18:rows){
merged_t_e <- merged_extracted[1:i-1,]
model1_e <- lm(data = merged_t_e, sales~lag(sales, 1)+lag(sales,12))
model2_e <- lm(data = merged_t_e, sales~lag(sales, 1)+lag(sales,12)+ suvs + insurance)
baseline_e$sales[i] <- predict(model1_e,merged_extracted[1:i,])[i]
trends_e$sales[i] <- predict(model2_e,merged_extracted[1:i,])[i]
}
This shows the filtering of the proper data for plotting
actual_e <- merged_extracted[18:rows,] %>% mutate(label="actual")
baseline_e <- baseline_e[18:rows,] %>% mutate(label="baseline")
trends_e <- trends_e[18:rows,] %>% mutate(label="trends")
This block shows the calculation of the Mean Absolute Error of the two models. We get an error of 6.36% for baseline model and 6.82% for the trends model, which is a 6.74% improvement which is also mentioned in the paper.
MAE(baseline_e$sales,actual_e$sales)
## [1] 0.06365923
MAE(trends_e$sales,actual_e$sales)
## [1] 0.0682912
This block shows the calculation of the Mean Absolute Error for the recession period. Here we get 8.90% using baseline model and 6.82% using trends model, which is 23.3% improvement.
recession_trends_e <- trends_e %>% filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_base_e <- baseline_e %>% filter(Period>="2007-12-01"& Period<="2009-06-01")
recession_actual_e <- actual_e %>% filter(Period>="2007-12-01"& Period<="2009-06-01")
MAE(recession_base_e$sales,recession_actual_e$sales)
## [1] 0.08903226
MAE(recession_trends_e$sales,recession_actual_e$sales)
## [1] 0.06821008
This graph shows the replication of the figure shown in the paper, which shows the actual sales, prediction using baseline model and prediction using trends model.
extracted_plot_data <- rbind(actual_e, baseline_e, trends_e)
ggplotly(ggplot(extracted_plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
geom_line()+
scale_colour_manual(values=c("black", "red","grey"))+
scale_linetype_manual(values = c("solid", "dashed", "solid"))+
ylab('log(sales)')+
xlab('period'))