In this blog post, I will show you how you can quickly and easily forecast a univariate time series.I am going to use data from the EU Open Data Portal on air passenger transport. You can find thedata here. I downloadedthe data in the TSV format for Luxembourg Airport, but you could repeat the analysis for any airport.
Once you have the data, load some of the package we are going to need:
1 |
|
and define the following function:
1 |
|
This function, the inverse hyperbolic sine, is useful to transform data in a manner that is veryclose to logging it, but that allows for 0’s. The data from Eurostat is not complete for some reason,so there are some 0 sometimes. To avoid having to log 0, which in R yields -Inf
, I use thistransformation.
Now, let’s load the data:
1 |
|
1 |
|
1 |
|
Let’s take a look at the data:
1 |
|
1 |
|
So yeah, useless in that state. The first column actually is composed of 3 columns, merged together,and instead of having one column with the date, and another with the value, we have one columnper date. Some cleaning is necessary before using this data.
Let’s start with going from a wide to a long data set:
1 |
|
The first line makes it possible to only select the columns that contain the string “20”, soselecting columns from 2000 onward. Then, using gather, I go from long to wide. The data lookslike this now:
1 |
|
Now, let’s separate the first column into 3 columns:
1 |
|
This separates the first column into 3 new columns, “unit”, “tra_meas” and “air_pr\time”. This stepis not necessary for the rest of the analysis, but might as well do it. The data looks like this now:
1 |
|
The next steps are simple renamings. I have copy-pasted the information from the Eurostat pagewhere you can view the data online.If you click here:
you will be able to select the variables you want displayed in the table, as well as the dictionaryof the variables. I simply copy pasted it and recoded the variables. You can take a look at thewhole cleaning workflow by clicking “Click to expand” below:
Click here to take a look at the whole cleaning workflow
1 |
|
1 |
|
There is quarterly data and monthly data. Let’s separate the two:
1 |
|
In the “date” column, I detect the observations with “Q” in their name, indicating that it is quarterly data.I do the same for monthly data, but I have to add the string “01” to the dates. This transformsa date that looks like this “2018M1” to this “2018M101”. “2018M101” can then be converted into adate by using the ymd()
function from lubridate. yq()
was used for the quarterly data.
1 |
|
Time for some plots. Let’s start with the raw data:
1 |
|
And now with the logged data (or rather, the data transformed using the inverted hyperbolic sinetransformation):
1 |
|
We clearly see a seasonal pattern in the data. There is also an upward trend. We will have to dealwith these two problems if we want to do some forecasting. For this, let’s limit ourselves to datafrom before 2015, and convert the “passengers” column from the data to a time series object, usingthe ts()
function:
1 |
|
We will try to pseudo-forecast the data from 2015 to the last point available, March 2018.First, let’s tranform the data:
1 |
|
Taking the log, or ihs of the data deals with stabilizing the variance of the time series.
There might also be a need to difference the data. Computing the differences between consecutiveobservations makes the time-series stationary. This will be taken care of by the auto.arima()
function, if needed. The auto.arima()
function returns the best ARIMA model according to differentstatistical criterions, sach us the AIC, AICc or BIC.
1 |
|
1 |
|
auto.arima()
found that the best model would be an (ARIMA(2, 1, 1)(2, 1, 0)_{12}). This is anseasonal autoregressive model, with p = 2, d = 1, q = 1, P = 2 and D = 1.
1 |
|
I can now forecast the model for the next 39 months (which correspond to the data available).
To plot the forecast, one could do a simple call to the plot function. But the resulting plotis not very aesthetic. To plot my own, I have to grab the data that was forecast, and do somemunging again:
1 |
|
as_tibble()
is a function from the {tsibble}
package that converts objects that are time-series awareto time-aware tibbles. If you are not familiar with ts_tibble()
, I urge you to run the above linesone by one, and especially to compare as_tsibble()
with the standard as_tibble()
from the {tibble}
package.
This is how estimated_data
looks:
1 |
|
1 |
|
We can now plot the data, with the forecast, and with the 95% confidence interval:
1 |
|
The pseudo-forecast (the dashed line) is not very far from the truth, only overestimating theseasonal peaks, but the true line is within the 95% confidence interval, which is good!
If you found this blog post useful, you might want to follow me on twitter for blog post updates.
Related