- Advanced Modeling
Tags
- Data Manipulation
- Data Visualisation
- ggplot2
- R Programming
The motivation for choosing this data set to explore is straightforward: I wanted a real-world data set for which additional meta data would be easy to find, making NYC data perfect; I wanted something which did not require specialised domain knowledge, and so buses are a good choice; and I wanted a data set that might produce some interesting insights. As an additional bonus, this data set had very few downloads or kernels on Kaggle, so it seemed like mostly untrodden ground.
Data are available from the NYC open data site. Linked is the main data page, which contains the bus breakdown ID number, as well as the route number, the schools serviced, the borough through which the bus travels, and so on.
The readme file included in this bundle also contains references to linking information. Specifically, the drivers and attendants, routes, transportation sites, vehicles, pre-k riders by transportation site, routes by transportation site, and pre-k vendors by transportation site. All of this data can be combined (see later) to provide route- and vendor- level data to the main delay data set. For example, to quanitfy the total number of drivers employed by each company, the total number of students or schools that they service, and so on.
Our goal is to be able to make some kind of useful prediction about how long a bus is going to be delayed, at the time its delay is called into the operations centre. So we take a look at our primary ‘ii_’ data set.
1 |
|
A bar graph broken up Reason shows that, as observed when cleaning up the delay data, these are ‘human approximated delays’. The driver or attendant (or perhaps someone at operations) has approximated the delay to the nearest convenient time package: 5 minutes, 10 minutes, 15 minutes, 20, 25, 30, 40, 45, 50, 60, 90 minutes, and so on. Some delays have been figured out more accurately by comparing time differences, but most have not. One of the consequences is that if we decide to run a regression model for this data, we can never expect it to be more precise than 5-10 minutes of RMS error. Because that is how the data have been recorded.
Note that there are a few data with delays beyond 100 minutes, and these points will show up later. The graph is presented this way for clarity. Traffic constitutes the major source of delay in the data set; accident is the most rare problem. ‘Other’, meaning a missing reason or undisclosed reason, is the second most common value in this field.
We can have a look at some of the structure via density plots:
1 |
|
1 |
|
Here we must be careful with interpretation, since the density plot presents counts convolved with a gaussian density kernel. This helps us to see structure, but we must be careful interpreting the counts. There are still very few observations with delay time of 22 minutes, for example.
But we can see some structure a little more clearly. For example, Manhattan produces more long-time delays than the Bronx or Staten Island.
In the next set of notes we will try to fit a regression, based only on this data. However, we can already see the need to join more data on as we have few predictors. Before building a model or joining to the other data sets, there is a little more cleaning and dummifying of categorical variables to do:
1 |
|
Related Post
- Explaining Black-Box Machine Learning Models – Code Part 2: Text classification with LIME
- Image clustering with Keras and k-Means
- ‘How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm.
- Machine learning: Logistic regression and decision trees for healthcare in R
- Machine Learning Benchmarking with SFA in R
Related