How do I make and use an ARIMA model?¶
With skits
, of course :)
But before we get there, remember how I said that we should be careful to make sure our data is stationary? Let me detail this point, first.
What’s Stationarity Got To Do With It?¶
The Wikipedia definition of a stationary process is “a stochastic process whose unconditional joint probability distribution does not change when shifted in time”. I’m not sure of the general usefulness of that definition, but their second comment is more accessible: “…parameters such as mean and variance, if they are present, also do not change over time.” That’s more palatable.
Now here’s the thing about stationarity: I’ve found it incredibly confusing to research this online (Author’s note: This is a sign that I should probably just read a damn textbook). Here’s the mess of assertions that I have come across:

The data must be stationary before it is fed into an ARIMA model.

The residuals must be stationary after they are calculated from the ARIMA model.

If the data is not stationary then difference it until it becomes stationary.

If the data is not stationary then try adding in a Trend term.
While much of the above feels contradictory, it’s all fairly correct. The following walks through how I understand this, but I would be happy for feedback on where I’m wrong:
One would like to feed stationary data into a linear model because this ensures that the model will not suffer from multicollinearity such that individual predictions worsen and interpretability is reduced. One way to (try to) stationarize the data is to difference it. One typically differences the data by hand in order to determine how many times one should difference the data to make it stationary. With this knowledge in hand, one then passes the undifferenced data into the ARIMA model. The ARIMA model containes a differencing step. Differencing by hand is performed to determine the differencing order paramater (like an ML hyperparameter!) for the ARIMA model. This is all part of the BoxJenkins method for building ARIMA models.
Another way to stationarize data is to add a trend term to a model, and we decide on differencing vs. trend terms depending on whether there is a stochastic or deterministic (pdf) trend in the data, respectively.
If the residuals are stationary after being fed through a linear model, then the GaussMarkov theorem guarantees us that we have found the best unbiased linear estimator (BLUE) of the data. Another way to think about this is that, if we see that the residuals are not stationary, then there is probably some pattern in the data that we should be able to incorporate into our model such that the residuals become stationary. There’s also a ton of bonus goodies that we get if we can follow GaussMarkov, such as accurate estimates of uncertainty. This leads to a big danger in that we may underestimate the uncertainty of our model (and consequently overestimate correlations) if we use GaussMarkov theorems on nonstationary data.
I got very hung up on determining whether the data entering the model must be stationary or the residuals that result from the model must be stationary because lots of literature focuses on stationarizing the data but the theory relies on stationary residuals. Stationarizing the data is a unique aspect of ARIMA models  it helps to determine the parameters of the ARIMA model due to some theoretical things that pop out, but it is perfectly fine to feed nonstationary data into a model and hope that the model produces stationary residuals (or not even produce stationary residuals if you’re happy to give up GaussMarkov).
One could make the argument
Who cares about the antiquated methods for determining the parameters of my ARIMA model? I’ll just figure them out with grid search and cross validation.
and I would half agree.
The devil on my shoulder cajoles
Yes! Ain’t nobody got time for learning old school statistics. Compute is cheap  go for it.
but the angel cautions
Well, you can do that, but it may be hard to find suitable parameters and you should be aware of what you’re sacrificing in the process.
See, a lot of the time series theory was developed and used in econometrics. In that field, it is particularly important to understand the “data generating process”, and one would like to be able to make decisions in the face of new data, new patterns, and new forecasts. There are ARIMA models that represent data generating processes that would be unrealistic in economic scenarios, and one would want to know this when developing a model!
What I would like to do is trade some of the stability and interpretability of classical time series modeling for the potentially higher accuracy and easier implementation (AKA I don’t have to know the stats or solve nonlinear PDEs) of supervised machine learning. After all, from my original most, I just want to know how many Citi Bikes will be occupying a station in the next 15 minutes. This is not mission critical!
So actually we can’t build ARIMA models with skits :(
But, we can build parts of them! Recall that the moving average terms make the problem such that we cannot write it in our nice design matrix form of $\hat{y_{t}} = f(\mathbf{X}_{t})$. So, we’ll stick with the integrated and autoregressive terms, for now.
To start, let’s import some standard libraries and load the old Citi Bike data. Recall that the data consists of the number of bikes available at a station as a function of time. Just by eye, we can see that there is some daily and weekly periodicity.