Bike Share: Time Series Analysis

Background & Approach

As 2019 inches closer, I’ve been thinking about the purpose of my blog. I’d like to use my blog as a forum to practice skills I may not get the opportunity to use otherwise. And who knows, maybe by practicing these skills I will see opportunities to apply them in my current work.

With that, yes, this is another blog post exploring time series. This time, with Bikes! I’m following along with this blog post (link)[https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials], and the work herein is largely a reproduction of this helpful resource.

The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bike share system, Washington D.C., USA which is publicly available in http://capitalbikeshare.com/system-data.

Data appear to be seasonal with an increase in the summer time period, which is understandable given more people would be riding bikes in nice weather.

The typical value per day is between 2500 - 7000, but we observe outliers. To work with a cleaner time series, we will use tsclean() from the forecast package on the counts vector to remove outliers and impute any missing values.

This chart shows the comparison of volatility when viewing data at the day, weekly, and monthly moving averages.

## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

The building blocks of a time series analysis are:

Seasonality - Is there an increase relative to time of year
Trend - Is there a general increase over time
Cycle - Increasing/decreasing patterns that aren’t seasonal

Additive & Multiplicative models: \[Y = S_t + T_t + E_t\]\[Y = S_t * T_t * E_t\]

Let’s look at the decomposed time series. Decomposing the time series into different components allows us to decide if we want to remove components.

Fitting an ARIMA model requires the series to be stationary. A series is said to be stationary when its mean, variance, and autocovariance are time invariant. Since ARIMA uses previous lags of series to model its behavior, modeling stable series with consistent properties involves less uncertainty.

For forecasting, it is important to be stationary. So I need to test for it. Null hypothesis assumes that series is non-stationary.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  count_ma
## Dickey-Fuller = -0.2557, Lag order = 8, p-value = 0.99
## alternative hypothesis: stationary

The p-value is 0.99, which is greater than 0.05, which means we fail to reject the null hypothesis that the series is non-stationary. Let’s try to help the series out. Differencing (d) will be an input to ARIMA later on.

ACF plots display correlation between a series and its lags.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  count_d1
## Dickey-Fuller = -9.9255, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

P-value is now less than 0.05, so we reject the null hypothesis that the series is non-stationary and conclude the transformed series is stationary.

Fitting ARIMA

We will use BIC to evaluate the model (lower is better).

## Series: deseasonal_cnt 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1      ma1
##       0.5510  -0.2496
## s.e.  0.0751   0.0849
## 
## sigma^2 estimated as 26180:  log likelihood=-4708.91
## AIC=9423.82   AICc=9423.85   BIC=9437.57

The model can be written as: \[ \hat Y_{d_t} = 0.551 Y_{t-1} - 0.2496 e_{t-1} + E\]

To evaluate the model, we want to look at the residuals.

Residuals

There is something going on at lag 7 based on the PACF chart. Let’s try to specify the ARIMA explicitly.

## 
## Call:
## arima(x = deseasonal_cnt, order = c(1, 1, 7))
## 
## Coefficients:
##          ar1     ma1     ma2     ma3     ma4     ma5     ma6      ma7
##       0.2803  0.1465  0.1524  0.1263  0.1225  0.1291  0.1471  -0.8353
## s.e.  0.0478  0.0289  0.0266  0.0261  0.0263  0.0257  0.0265   0.0285
## 
## sigma^2 estimated as 14392:  log likelihood = -4503.28,  aic = 9024.56

The AIC for the second model with MA = 7 is lower, so this model has better fit than the auto-ARIMA choice.

Let’s forecast

Let’s forecast out the next 30 days using our second model.

To see how our model performs against actual, we can specify a hold out group and then forecast for that period.

Let’s add back in seasonality & evaluate the fit. We need to specify the model using auto.arima(), then forecast using forecast() and input the arima model object and number of periods for forecasting.

## Series: deseasonal_cnt 
## ARIMA(2,1,2)(1,0,0)[30] 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2    sar1
##       1.3644  -0.8027  -1.2903  0.9146  0.0100
## s.e.  0.0372   0.0347   0.0255  0.0202  0.0388
## 
## sigma^2 estimated as 24810:  log likelihood=-4688.59
## AIC=9389.17   AICc=9389.29   BIC=9416.68

## [1] 9437.571

## [1] 9065.827

## [1] 9416.681

Conclusion / What have I learned:

What is true for other modeling appears to hold true for time-series forecasting. It is an iterative approach that requires building a simple model (auto-ARIMA), looking at the residuals and model evaluation metric of choice (e.g. AIC or BIC).

Packages used: forecast, time series, ggplot2, ggfortify

The general steps I followed were:

Prep data environment (load packages/data)
Plot bike share counts by day & visually inspect for outliers
Convert to time series & using tsclean() to remove outliers from time series object
Aggregate counts to weekly moving average using ma function
Create time series object from weekly moving average column appended to original data set (by day)
Input time series into stl() function to decompose seasonal, trend, and cycle components
Run the ADF test to test if the series is stationary or if transformations need to be run
Use differences of order 1 to improve stationary result.
Iterate upon model fit by using tsdisplay to visually inspect residuals, ACF, and PACF plots, along with AIC/BIC output.
Use final model to forecast in the future. To evaluate forecast, can use a holdout group and plot over actuals.