Time Series for scikit-learn People (Part I): Where's the X Matrix?

When I first started to learn about machine learning, specifically supervised learning, I eventually felt comfortable with taking some input $\mathbf{X}$, and determining a function $f(\mathbf{X})$ that best maps $\mathbf{X}$ to some known output value $y$. Separately, I dove a little into time series analysis and thought of this as a completely different paradigm. In time series, we don’t think of things in terms of features or inputs; rather, we have the time series $y$, and $y$ alone, and we look at previous values of $y$ to predict future values of $y$.

Still, I often wondered, “Where does machine learning end and time series begin?”. Or, “How do I use “features” in time series?”. I guess if you can answer these questions really well, then you’re in for making some money in finance. Or not. I have no idea - I really know nothing about that world.

I have had trouble answering these questions, in no small part due to the common difficulty of dealing with domain-specific vocabulary. I find myself having to internally translate time series concepts into “familiar” machine learning concepts. With this in mind, I would thus like to kick off a series of blog posts around analyzing time series data with the hopes of presenting these concepts in a familiar form. Due to the ubiquity of scikit-learn, I’ll assume that the scikit-learn API constitutes a familiar form.

To start the series off, in this post I’ll introduce a time series dataset that I’ve gathered. I’ll then walk through how we can turn the time series forecasting problem into a classic linear regression problem.

The requirements for a suitable time series dataset are fairly minimal: We need some quantity that changes with time. Ideally, our data set could exhibit some patterns such that we can learn some things like seasonality and cyclic behavior. Thankfully, I’ve got just the dataset!

This is where you groan because I say that the dataset is related to the Citi Bike NYC bikeshare data. Everybody writes about this damn dataset, and it’s been beaten to death. Well, I agree, but hear me out. I’m not using the same dataset that everybody else uses. The typical Citi Bike dataset consists of all trips taken on the bikes. My dataset looks at the the number of bikes at each station as a function of time. Eh?

I started collecting this data about a year and a half ago because I was dealing with a common, frustrating scenario. I would check the Citi Bike app to make sure that there would be docks available at the station by my office before I left my apartment in the morning. However, by the time I rode to the office station, all the docks would have filled up, and I’d then have to go searching for a different station at which to dock the bike. Ideally, my app should tell me that, even though there are 5 docks available right now, they will likely be unavailable by the time I get there.

I wanted to collect this data and predict this myself. I still haven’t done that, but hopefully we will later on in this blog series. In the meantime, since 9/18/2016, I pinged the Citi Bike API every 2 minutes to collect how many bikes and docks are available at every single citi bike station in NYC. In the spirit of not planning ahead, this all just runs on a cron job on a t2.micro EC2 instance backed by a Postgres database that is running locally on the same instance. There are some gaps in the data due to me occasionally running out of hard drive space on the instance. The code for this lives in this repo, and I’ve stored more than 200 million records.

For more information about this dataset, I wrote a brief post on Making Dia.

For our purposes today, I am going to focus on a single time series from this data. The time series consists of the number of available bikes at the station at East 16th St and 5th Ave (i.e. the closest one to my apartment) as a function of time. Specifically, time is indexed by the last_communication_time. The Citi Bike API seems to update its values with random periodicity for different stations. The last_communication_time corresponds to the last time that the Citi Bike API talked to the station at the time of me querying.

We’ll start by reading the data in with pandas. Pandas is probably the preferred library to use for exploring time series data in Python. It was originally built for analyzing financial data which is why it shines so well for time series. For an excellent resource on time series modeling in pandas, check out Tom Aguspurger’s post in his Modern Pandas series. While I found that post to be extremely helpful, I am more interested in why one does certain things with time series as opposed to how to do these things.

My bike availability time series is in the form of a pandas Series object and is stored as a pickle file. Often, one does not care about the order of the index in Pandas objects, but, for time series, you will want to sort the values in chronological order. Note that I make sure the index is a sorted pandas DatetimeIndex.