Predicting using the author data

This notebook is an expansion on the results from the paper Predicting the Present with Google Trends.. In this notebook, we are trying to predict the future as in predict the data for June 2020 in May 2020. The code can be modified to predict more into the future, as in predict the sales for the whole next year. ## Setup

library(scales)
library(tidyverse)
library(plotly)
library(lubridate)
library(broom)
library(modelr)
library(readxl)
library(zoo)
library(caret)
theme_set(theme_bw())

knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

Load Data

The proper version of CSV files with cleaned data is given below which I recreated for the replication of the results that the authors obtained. We modify it a little based on the description in the paper. This data is extracted from the original sources mentioned by the author in the paper. The format of Google Trends data has since changed and hence there is a very small difference between the authors results and the replicated results.

Sources:

Sales: Census

Trucks & SUVs: Google Trends

Auto Insurance: Google Trends

merged_author <- read_csv("merged_present.csv")

merged_author$sales<-log(merged_author$sales)
merged_author$Period<-as.Date(merged_author$Period,"%m/%d/%Y")
rows <- nrow(merged_author)

Predict Current Month Data

The following part shows the making of models and the prediction of the baseline and trends data. It uses the pipeline format described by the author using K=96.

model_data <- merged_author
model_data <- model_data %>% mutate(lag2 =                                       lag(sales,2),lag12=lag(sales,12),suvs1=lag(suvs,1),insurance1=lag(insurance,1))
model_base <- model_data
model_trends <- model_data

K <- 96

for (i in K:rows){
  model_t <- model_data[1:i-1,]
  model1_t <- lm(data = model_t, sales~lag2+lag12)
  model2_t <- lm(data = model_t, sales~lag2+lag12+ suvs1 + insurance1)
  model_base$sales[i] <- predict(model1_t,model_data[1:i,])[i]
  model_trends$sales[i] <- predict(model2_t,model_data[1:i,])[i]
}
tidy(model1_t)

glance(model1_t)

tidy(model2_t)

glance(model2_t)

Compile Data

This shows the filtering of the proper data for plotting

model_actual <- model_data[K:rows,] %>% mutate(label="actual")
model_base <- model_base[K:rows,] %>% mutate(label="baseline")
model_trends <- model_trends[K:rows,] %>% mutate(label="trends")

Mean Absolute Error

This block shows the calculation of the Mean Absolute Error of the two models. We get an error of 4.80% for baseline model and 5.91% for the trends model.

MAE(model_base$sales,model_actual$sales)

## [1] 0.04814274

MAE(model_trends$sales,model_actual$sales)

## [1] 0.05910185

Selected Period MAE

This block shows the calculation of the Mean Absolute Error for the recession period. Here we get 4.26% using baseline model and 5.42% using trends model. We also show the Rsquare values for each type of the data distribution.

period_start <- "2012-01-01"
period_end <- "2020-01-01"
part_trends_mod <- model_trends %>% filter(Period>=period_start& Period<=period_end)
part_base_mod <- model_base %>% filter(Period>=period_start& Period<=period_end)
part_actual_mod <- model_actual %>% filter(Period>=period_start& Period<=period_end)
MAE(part_base_mod$sales,part_actual_mod$sales)

## [1] 0.04264519

MAE(part_trends_mod$sales,part_actual_mod$sales)

## [1] 0.05425855

rsquare(model1_t,part_actual_mod)

## [1] 0.8677416

cor(part_base_mod$sales,part_actual_mod$sales)^2

## [1] 0.8536572

cor(part_trends_mod$sales,part_actual_mod$sales)^2

## [1] 0.8234873

Plot

This graph shows the replication of the figure shown in the paper, which shows the actual sales, prediction using baseline model and prediction using trends model.

model_plot_data <- rbind(model_actual, model_base, model_trends) %>% filter(Period>=period_start& Period<=period_end)
ggplotly(ggplot(model_plot_data, aes(x=Period, y = sales, color = label, linetype = label))+
  geom_line()+
  scale_colour_manual(values=c("black", "red","grey"))+
  scale_linetype_manual(values = c("solid", "dashed", "solid"))+
  ylab('log(sales)')+
  xlab('period'))

Predicting

Krushang Shah and Rajiv Basnet

6/25/2020