Visualization of NYC bus delays with R

  1. Advanced Modeling

Tags

  • Data Manipulation
  • Data Visualisation
  • ggplot2
  • R Programming

The motivation for choosing this data set to explore is straightforward: I wanted a real-world data set for which additional meta data would be easy to find, making NYC data perfect; I wanted something which did not require specialised domain knowledge, and so buses are a good choice; and I wanted a data set that might produce some interesting insights. As an additional bonus, this data set had very few downloads or kernels on Kaggle, so it seemed like mostly untrodden ground.

Data are available from the NYC open data site. Linked is the main data page, which contains the bus breakdown ID number, as well as the route number, the schools serviced, the borough through which the bus travels, and so on.

The readme file included in this bundle also contains references to linking information. Specifically, the drivers and attendants, routes, transportation sites, vehicles, pre-k riders by transportation site, routes by transportation site, and pre-k vendors by transportation site. All of this data can be combined (see later) to provide route- and vendor- level data to the main delay data set. For example, to quanitfy the total number of drivers employed by each company, the total number of students or schools that they service, and so on.

Our goal is to be able to make some kind of useful prediction about how long a bus is going to be delayed, at the time its delay is called into the operations centre. So we take a look at our primary ‘ii_’ data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
in_csv <- "../data/II_Bus_Breakdown_and_Delays.csv"
ii_breakdowns <- read_csv(in_csv)
ii_breakdowns %>% glimpse
## Observations: 259,637
## Variables: 21
## $ School_Year "2015-2016", "2015-2016", "201...
## $ Busbreakdown_ID 1212699, 1212700, 1212701, 121...
## $ Run_Type "Special Ed AM Run", "Special ...
## $ Bus_No "48186", "2518", "235", "2102"...
## $ Route_Number "N758", "L530", "K168", "K216"...
## $ Reason "Other", "Mechanical Problem",...
## $ Schools_Serviced "75485", "21854", "18366", "21...
## $ Occurred_On 2015-09-02 06:27:00, 2015-09-...
## $ Created_On 2015-09-02 06:29:00, 2015-09-...
## $ Boro "Nassau County", "Brooklyn", "...
## $ Bus_Company_Name "BORO TRANSIT, INC.", "RELIANT...
## $ How_Long_Delayed "25 minutes", NA, "30MINS", "2...
## $ Number_Of_Students_On_The_Bus 0, 0, 0, 1, 0, 0, 0, 9, 0, 2, ...
## $ Has_Contractor_Notified_Schools "Yes", "Yes", "Yes", "Yes", "Y...
## $ Has_Contractor_Notified_Parents "No", "Yes", "Yes", "Yes", "Ye...
## $ Have_You_Alerted_OPT "No", "Yes", "No", "No", "No",...
## $ Informed_On 2015-09-02 06:29:00, 2015-09-...
## $ Incident_Number NA, NA, NA, NA, NA, NA, NA, NA...
## $ Last_Updated_On 2015-09-02 06:29:16, 2015-09-...
## $ Breakdown_or_Running_Late "Running Late", "Breakdown", "...
## $ School_Age_or_PreK "School-Age", "School-Age", "S...

`ii_breakdowns %>% distinct(How_Long_Delayed) %>% head(n=20)`
## # A tibble: 20 x 1
## How_Long_Delayed
## 
## 1 25 minutes 
## 2 
## 3 30MINS 
## 4 20 min 
## 5 30 min 
## 6 90 MIN 
## 7 20 MINS 
## 8 20 
## 9 30 mins 
## 10 15 MINS 
## 11 20 MIN 
## 12 15 mins 
## 13 30 MINS 
## 14 25 
## 15 0 
## 16 10 mins 
## 17 25 min 
## 18 1hr 
## 19 20 mins 
## 20 45 mins

There is an obvious problem: these delays are 'human recorded', meaning that the format is inconsistent. Sometimes we have minutes, min, mts, hours, hrs; sometimes just two numbers separated by ':', and so on. Additionally, the times are not accurately recorded, as we will verify in a moment.
Wrangling the delay times into something machine readable takes some doing. I don't doubt that there is a more general path available, but I went through sequentially, case by case, as generally as I could; then cleaned up the rest at the end.
#--sequential clean
ii_times <- ii_breakdowns %>%
 mutate(units_delayed = str_extract(How_Long_Delayed, "[:alpha:]+"),
 time_delayed = as.numeric(str_extract(How_Long_Delayed, "[:digit:]+")),
 units_delayed = ifelse(str_detect(units_delayed, "^[mM]"), "minutes", "other"),
 units_delayed = ifelse(is.na(units_delayed), "other", units_delayed)
 ) %>%
 #remove "in" case, exclude any flagged with "[hH}"
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "in") & !str_detect(How_Long_Delayed, "[hH]"),"minutes", units_delayed)
 ) %>%
 #deal with "to" case with minutes by taking only the first number. There are apparently no 'to' strings with hours in them.
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "to") & str_detect(How_Long_Delayed, "[mM]"),"minutes", units_delayed)
 ) %>%
 #deal with the - cases which don't involve hours at all
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-") & !str_detect(How_Long_Delayed, "[hH]"),"minutes", units_delayed)
 ) %>%
 #deal with the double digit minute-1hr:
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-") & str_detect(How_Long_Delayed, "^[1-5][0-9]"),"minutes", units_delayed)
 ) %>%
 #everything left with a hyphen is a 1 hour or more delay
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-"), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-"), "minutes", units_delayed)
 ) %>%
 #-all hyphens dealt with at this point
 #Now work on the "/": first take the minute timers: all those with double digits first
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "/") & str_detect(How_Long_Delayed, "^[1-9][0-9]"), "minutes", units_delayed)
 ) %>%
 #All those starting with 1/2 go to 30 minutes
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1/2"), 30, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1/2"), "minutes", units_delayed)
 ) %>%
 #All those starting with 1 1/2 go to 90 minutes
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1 1/2"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1 1/2"), "minutes", units_delayed)
 ) %>%
 #---specific cases
 #"1 HOUR1/2"
 mutate(
 time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1 HOUR1/2", 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1 HOUR1/2", "minutes", units_delayed)
 ) %>%
 #"@least 1/2"
 mutate(
 time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "@least 1/2", 30, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "@least 1/2", "minutes", units_delayed)
 ) %>%
 #"1HR/20MIN"
 mutate(
 time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1HR/20MIN", 80, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1HR/20MIN", "minutes", units_delayed)
 )%>%
 #--end specific cases
 #1hr/x case
 #This is modifying things it shouldn't. Why is this even here?
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "1hr"), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "1hr"), "minutes", units_delayed)
 )%>%
 #---cases with ":"
 #remove cases which erroneously report a time instead of a duration, since these are not reliable
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, ":") & str_detect(How_Long_Delayed, "[aA][mM]"), NA, time_delayed)
 )%>%
 #2:hr cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2:[hH][rR]"), 120, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2:[hH][rR]"), "minutes", units_delayed)
 )%>%
 #1:45 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:45"), 105, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:45"), "minutes", units_delayed)
 )%>%
 #1:30 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:30"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:30"), "minutes", units_delayed)
 )%>%
 #1: cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:"), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:"), "minutes", units_delayed)
 )%>%
 #--end cases with ":"
 #--cases with "^2"
 #2 hr cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2") & str_detect(How_Long_Delayed, "[hH]"), 120, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2") & str_detect(How_Long_Delayed, "[hH]"), "minutes", units_delayed)
 )%>%
 #double digit minute cases:
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2[0-9]"), "minutes", units_delayed) #times are already correct
 )%>%
 #---"^1"
 #1 hour cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.*[rR]$") & !(str_detect(How_Long_Delayed, "^1.5.*[rR]$")), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.*[rR]$") & !(str_detect(How_Long_Delayed, "^1.5.*[rR]$")), "minutes", units_delayed)
 )%>%
 #1.5 hour cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.5"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.5"), "minutes", units_delayed)
 )%>%
 #NR: some of the ^1 x min cases going in here are already tagged as minutes.
 #1 hr 30 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "30"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "30"), "minutes", units_delayed)
 )%>%
 #1 hr 15 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "15"), 75, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "15"), "minutes", units_delayed)
 )%>%
 #1 hr 20 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "20"), 80, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "20"), "minutes", units_delayed)
 )%>%
 #1 hr 45 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "45"), 105, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "45"), "minutes", units_delayed)
 )%>%
 #1 hr half
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "half"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "half"), "minutes", units_delayed)
 )%>%
 #1 h cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "[hH]"), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "[hH]"), "minutes", units_delayed)
 )%>%
 #---has pattern "^[1-9][oO]", that is, o replaces 0
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^[1-9][oO]"), time_delayed*10, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^[1-9][oO]") , "minutes", units_delayed)
 ) %>%
 #---final outliers and rejects
 #just switch to minutes
 mutate(
 units_delayed = ifelse(How_Long_Delayed == "35 SM", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "30am", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "30INS", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45 late", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "40n", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45 am", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45NIN", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45 INUTES", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45IN", "minutes", units_delayed)
 ) %>%
 #needs time modifications
 mutate(
 time_delayed = ifelse(How_Long_Delayed == "!0 mins", 1, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "0-15 Min", 5, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "0-15 Min", 5, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "MAYBE 1/2", 30, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "half hour", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "Half hour", 30, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "Half hour", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "one hour", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "one hour", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "ONE HOUR", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "ONE HOUR", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "HOUR", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "HOUR", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "hour", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "hour", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "HR1", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "HR1", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1/ 15 min", 75, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1:20MIN", 80, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1 20 min", 80, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1:30 min", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1:30?mins", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1 30mnts", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1 30 min", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1:40 MINS", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "IHR40MIN", 100, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "IHR40MIN", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "3 HRS", 180, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "3 HRS", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "3 hr", 180, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "3 hr", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "3 HOUR", 180, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "3 HOUR", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "4 hours", 240, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "4 hours", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "4 hrs", 240, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "4 hrs", "minutes", units_delayed)
 ) %>%
 #sometimes the record just reads 'mins' or equivalent. My 'units_delayed' picks this up as 'minutes'. We take a guess and set this to 5 minutes.
 mutate(
 time_delayed = ifelse(units_delayed == "minutes" & str_detect(How_Long_Delayed, "[mM][iI][nN]") & is.na(time_delayed), 5, time_delayed)
 ) %>%
 filter(!(is.na(time_delayed))) %>%
 filter(time_delayed <= 300) %>% #Any delays beyond 5 hours are probably typos, but also problematic if real.
 filter(units_delayed == "minutes") %>% #222 820 / 259 637. The rest are garbage or NA.
 select(
 Busbreakdown_ID, School_Year, Route_Number, Schools_Serviced, Bus_Company_Name,
 Has_Contractor_Notified_Schools, Has_Contractor_Notified_Parents, Have_You_Alerted_OPT, Reason,
 Boro, Number_Of_Students_On_The_Bus, School_Age_or_PreK,
 Created_On, Occurred_On,
 time_delayed
 )

#---plot of the time_delayed
library(RColorBrewer)

pal <- brewer.pal(12, "Paired")

reason_levels <- ii_times %>%
 group_by(Reason) %>%
 mutate(
 total_of_reason = n()
 ) %>% distinct(Reason, total_of_reason) %>%
 arrange(total_of_reason) %>%
 ungroup() %>%
 mutate(total_of_reason = row_number())

ii_times %>%
 mutate(Reason = factor(Reason, levels = reason_levels[[1]])) %>%
 mutate(time_delayed = time_delayed) %>%
 ggplot(aes(x = time_delayed, fill = Reason)) +
 geom_bar() +
 theme_classic() +
 scale_fill_manual(values = pal) +
 coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
 labs(x="time_delayed [min]",y="Observations") +
 theme(text=element_text(family="Arial", size=16)) +
 theme(axis.line = element_line(colour = 'black', size = 1)) +
 theme(axis.ticks = element_line(colour = "black", size = 1)) +
 ggtitle("Observations by Reason")

A bar graph broken up Reason shows that, as observed when cleaning up the delay data, these are ‘human approximated delays’. The driver or attendant (or perhaps someone at operations) has approximated the delay to the nearest convenient time package: 5 minutes, 10 minutes, 15 minutes, 20, 25, 30, 40, 45, 50, 60, 90 minutes, and so on. Some delays have been figured out more accurately by comparing time differences, but most have not. One of the consequences is that if we decide to run a regression model for this data, we can never expect it to be more precise than 5-10 minutes of RMS error. Because that is how the data have been recorded.

Note that there are a few data with delays beyond 100 minutes, and these points will show up later. The graph is presented this way for clarity. Traffic constitutes the major source of delay in the data set; accident is the most rare problem. ‘Other’, meaning a missing reason or undisclosed reason, is the second most common value in this field.

We can have a look at some of the structure via density plots:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ii_times %>%
 mutate(Reason = factor(Reason, levels = reason_levels[[1]])) %>%
 mutate(time_delayed = time_delayed) %>%
 ggplot(aes(x = time_delayed, fill = Reason)) +
 geom_density(aes(y=..count..), size = 0, adjust = 4, position = "stack", color = NA) +
 theme_classic() +
 scale_fill_manual(values = pal) +
 coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
 labs(x="time_delayed [min]",y="Observation count") +
 theme(text=element_text(family="Arial", size=16)) +
 theme(axis.line = element_line(colour = 'black', size = 1)) +
 theme(axis.ticks = element_line(colour = "black", size = 1)) +
 ggtitle("Density: observations by Reason")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
boro_levels <- ii_times %>%
 mutate(
 Boro = ifelse(is.na(Boro), "All Boroughs", Boro),
 ) %>%
 group_by(Boro) %>%
 mutate(
 total_of_boro = n()
 ) %>% distinct(Boro, total_of_boro) %>%
 arrange(total_of_boro) %>%
 ungroup() %>%
 mutate(total_of_boro = row_number())

ii_times %>%
 mutate(
 Boro = ifelse(is.na(Boro), "All Boroughs", Boro),
 ) %>%
 mutate(Boro = factor(Boro, levels = boro_levels[[1]])) %>%
 ggplot(aes(x = time_delayed, fill = Boro)) +
 geom_density(aes(y=..count..), size = 0, adjust = 4, position = "stack", color = NA) +
 theme_classic() +
 scale_fill_manual(values = pal) +
 coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
 labs(x="time_delayed [min]",y="Observation count") +
 theme(text=element_text(family="Arial", size=16)) +
 theme(axis.line = element_line(colour = 'black', size = 1)) +
 theme(axis.ticks = element_line(colour = "black", size = 1)) +
 ggtitle("Density: observations by Borough")

Here we must be careful with interpretation, since the density plot presents counts convolved with a gaussian density kernel. This helps us to see structure, but we must be careful interpreting the counts. There are still very few observations with delay time of 22 minutes, for example.

But we can see some structure a little more clearly. For example, Manhattan produces more long-time delays than the Bronx or Staten Island.

In the next set of notes we will try to fit a regression, based only on this data. However, we can already see the need to join more data on as we have few predictors. Before building a model or joining to the other data sets, there is a little more cleaning and dummifying of categorical variables to do:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
ii_spread <- ii_times %>%
 mutate(
 time_diff_report = abs(as.numeric(difftime(Created_On, Occurred_On, units = "mins"))) 
 ) %>%
 mutate(
 reported_before_resolved = (time_delayed > time_diff_report)*1L,
 Created_On = NULL,
 Has_Contractor_Notified_Schools = (Has_Contractor_Notified_Schools == "Yes")*1L,
 Has_Contractor_Notified_Parents = (Has_Contractor_Notified_Parents == "Yes")*1L,
 Have_You_Alerted_OPT = (Have_You_Alerted_OPT == "Yes")*1L,
 School_Age = (School_Age_or_PreK == "School-Age")*1L,
 School_Age_or_PreK = NULL
 ) %>%
 mutate(dummy = 1L,
 Reason = str_replace_all(Reason, " ", ""),
 Reason = str_replace_all(Reason, "
  1. Advanced Modeling

Tags

  • Data Manipulation
  • Data Visualisation
  • ggplot2
  • R Programming

The motivation for choosing this data set to explore is straightforward: I wanted a real-world data set for which additional meta data would be easy to find, making NYC data perfect; I wanted something which did not require specialised domain knowledge, and so buses are a good choice; and I wanted a data set that might produce some interesting insights. As an additional bonus, this data set had very few downloads or kernels on Kaggle, so it seemed like mostly untrodden ground.

Data are available from the NYC open data site. Linked is the main data page, which contains the bus breakdown ID number, as well as the route number, the schools serviced, the borough through which the bus travels, and so on.

The readme file included in this bundle also contains references to linking information. Specifically, the drivers and attendants, routes, transportation sites, vehicles, pre-k riders by transportation site, routes by transportation site, and pre-k vendors by transportation site. All of this data can be combined (see later) to provide route- and vendor- level data to the main delay data set. For example, to quanitfy the total number of drivers employed by each company, the total number of students or schools that they service, and so on.

Our goal is to be able to make some kind of useful prediction about how long a bus is going to be delayed, at the time its delay is called into the operations centre. So we take a look at our primary ‘ii_’ data set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
in_csv <- "../data/II_Bus_Breakdown_and_Delays.csv"
ii_breakdowns <- read_csv(in_csv)
ii_breakdowns %>% glimpse
## Observations: 259,637
## Variables: 21
## $ School_Year "2015-2016", "2015-2016", "201...
## $ Busbreakdown_ID 1212699, 1212700, 1212701, 121...
## $ Run_Type "Special Ed AM Run", "Special ...
## $ Bus_No "48186", "2518", "235", "2102"...
## $ Route_Number "N758", "L530", "K168", "K216"...
## $ Reason "Other", "Mechanical Problem",...
## $ Schools_Serviced "75485", "21854", "18366", "21...
## $ Occurred_On 2015-09-02 06:27:00, 2015-09-...
## $ Created_On 2015-09-02 06:29:00, 2015-09-...
## $ Boro "Nassau County", "Brooklyn", "...
## $ Bus_Company_Name "BORO TRANSIT, INC.", "RELIANT...
## $ How_Long_Delayed "25 minutes", NA, "30MINS", "2...
## $ Number_Of_Students_On_The_Bus 0, 0, 0, 1, 0, 0, 0, 9, 0, 2, ...
## $ Has_Contractor_Notified_Schools "Yes", "Yes", "Yes", "Yes", "Y...
## $ Has_Contractor_Notified_Parents "No", "Yes", "Yes", "Yes", "Ye...
## $ Have_You_Alerted_OPT "No", "Yes", "No", "No", "No",...
## $ Informed_On 2015-09-02 06:29:00, 2015-09-...
## $ Incident_Number NA, NA, NA, NA, NA, NA, NA, NA...
## $ Last_Updated_On 2015-09-02 06:29:16, 2015-09-...
## $ Breakdown_or_Running_Late "Running Late", "Breakdown", "...
## $ School_Age_or_PreK "School-Age", "School-Age", "S...

`ii_breakdowns %>% distinct(How_Long_Delayed) %>% head(n=20)`
## # A tibble: 20 x 1
## How_Long_Delayed
## 
## 1 25 minutes 
## 2 
## 3 30MINS 
## 4 20 min 
## 5 30 min 
## 6 90 MIN 
## 7 20 MINS 
## 8 20 
## 9 30 mins 
## 10 15 MINS 
## 11 20 MIN 
## 12 15 mins 
## 13 30 MINS 
## 14 25 
## 15 0 
## 16 10 mins 
## 17 25 min 
## 18 1hr 
## 19 20 mins 
## 20 45 mins

There is an obvious problem: these delays are 'human recorded', meaning that the format is inconsistent. Sometimes we have minutes, min, mts, hours, hrs; sometimes just two numbers separated by ':', and so on. Additionally, the times are not accurately recorded, as we will verify in a moment.
Wrangling the delay times into something machine readable takes some doing. I don't doubt that there is a more general path available, but I went through sequentially, case by case, as generally as I could; then cleaned up the rest at the end.
#--sequential clean
ii_times <- ii_breakdowns %>%
 mutate(units_delayed = str_extract(How_Long_Delayed, "[:alpha:]+"),
 time_delayed = as.numeric(str_extract(How_Long_Delayed, "[:digit:]+")),
 units_delayed = ifelse(str_detect(units_delayed, "^[mM]"), "minutes", "other"),
 units_delayed = ifelse(is.na(units_delayed), "other", units_delayed)
 ) %>%
 #remove "in" case, exclude any flagged with "[hH}"
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "in") & !str_detect(How_Long_Delayed, "[hH]"),"minutes", units_delayed)
 ) %>%
 #deal with "to" case with minutes by taking only the first number. There are apparently no 'to' strings with hours in them.
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "to") & str_detect(How_Long_Delayed, "[mM]"),"minutes", units_delayed)
 ) %>%
 #deal with the - cases which don't involve hours at all
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-") & !str_detect(How_Long_Delayed, "[hH]"),"minutes", units_delayed)
 ) %>%
 #deal with the double digit minute-1hr:
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-") & str_detect(How_Long_Delayed, "^[1-5][0-9]"),"minutes", units_delayed)
 ) %>%
 #everything left with a hyphen is a 1 hour or more delay
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-"), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-"), "minutes", units_delayed)
 ) %>%
 #-all hyphens dealt with at this point
 #Now work on the "/": first take the minute timers: all those with double digits first
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "/") & str_detect(How_Long_Delayed, "^[1-9][0-9]"), "minutes", units_delayed)
 ) %>%
 #All those starting with 1/2 go to 30 minutes
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1/2"), 30, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1/2"), "minutes", units_delayed)
 ) %>%
 #All those starting with 1 1/2 go to 90 minutes
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1 1/2"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1 1/2"), "minutes", units_delayed)
 ) %>%
 #---specific cases
 #"1 HOUR1/2"
 mutate(
 time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1 HOUR1/2", 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1 HOUR1/2", "minutes", units_delayed)
 ) %>%
 #"@least 1/2"
 mutate(
 time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "@least 1/2", 30, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "@least 1/2", "minutes", units_delayed)
 ) %>%
 #"1HR/20MIN"
 mutate(
 time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1HR/20MIN", 80, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1HR/20MIN", "minutes", units_delayed)
 )%>%
 #--end specific cases
 #1hr/x case
 #This is modifying things it shouldn't. Why is this even here?
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "1hr"), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "1hr"), "minutes", units_delayed)
 )%>%
 #---cases with ":"
 #remove cases which erroneously report a time instead of a duration, since these are not reliable
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, ":") & str_detect(How_Long_Delayed, "[aA][mM]"), NA, time_delayed)
 )%>%
 #2:hr cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2:[hH][rR]"), 120, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2:[hH][rR]"), "minutes", units_delayed)
 )%>%
 #1:45 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:45"), 105, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:45"), "minutes", units_delayed)
 )%>%
 #1:30 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:30"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:30"), "minutes", units_delayed)
 )%>%
 #1: cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:"), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:"), "minutes", units_delayed)
 )%>%
 #--end cases with ":"
 #--cases with "^2"
 #2 hr cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2") & str_detect(How_Long_Delayed, "[hH]"), 120, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2") & str_detect(How_Long_Delayed, "[hH]"), "minutes", units_delayed)
 )%>%
 #double digit minute cases:
 mutate(
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2[0-9]"), "minutes", units_delayed) #times are already correct
 )%>%
 #---"^1"
 #1 hour cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.*[rR]$") & !(str_detect(How_Long_Delayed, "^1.5.*[rR]$")), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.*[rR]$") & !(str_detect(How_Long_Delayed, "^1.5.*[rR]$")), "minutes", units_delayed)
 )%>%
 #1.5 hour cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.5"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.5"), "minutes", units_delayed)
 )%>%
 #NR: some of the ^1 x min cases going in here are already tagged as minutes.
 #1 hr 30 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "30"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "30"), "minutes", units_delayed)
 )%>%
 #1 hr 15 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "15"), 75, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "15"), "minutes", units_delayed)
 )%>%
 #1 hr 20 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "20"), 80, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "20"), "minutes", units_delayed)
 )%>%
 #1 hr 45 cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "45"), 105, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "45"), "minutes", units_delayed)
 )%>%
 #1 hr half
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "half"), 90, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "half"), "minutes", units_delayed)
 )%>%
 #1 h cases
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "[hH]"), 60, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "[hH]"), "minutes", units_delayed)
 )%>%
 #---has pattern "^[1-9][oO]", that is, o replaces 0
 mutate(
 time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^[1-9][oO]"), time_delayed*10, time_delayed),
 units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^[1-9][oO]") , "minutes", units_delayed)
 ) %>%
 #---final outliers and rejects
 #just switch to minutes
 mutate(
 units_delayed = ifelse(How_Long_Delayed == "35 SM", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "30am", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "30INS", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45 late", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "40n", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45 am", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45NIN", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45 INUTES", "minutes", units_delayed),
 units_delayed = ifelse(How_Long_Delayed == "45IN", "minutes", units_delayed)
 ) %>%
 #needs time modifications
 mutate(
 time_delayed = ifelse(How_Long_Delayed == "!0 mins", 1, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "0-15 Min", 5, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "0-15 Min", 5, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "MAYBE 1/2", 30, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "half hour", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "Half hour", 30, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "Half hour", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "one hour", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "one hour", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "ONE HOUR", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "ONE HOUR", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "HOUR", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "HOUR", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "hour", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "hour", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "HR1", 60, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "HR1", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1/ 15 min", 75, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1:20MIN", 80, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1 20 min", 80, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1:30 min", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1:30?mins", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1 30mnts", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1 30 min", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "1:40 MINS", 90, time_delayed),
 time_delayed = ifelse(How_Long_Delayed == "IHR40MIN", 100, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "IHR40MIN", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "3 HRS", 180, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "3 HRS", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "3 hr", 180, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "3 hr", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "3 HOUR", 180, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "3 HOUR", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "4 hours", 240, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "4 hours", "minutes", units_delayed),
 time_delayed = ifelse(How_Long_Delayed == "4 hrs", 240, time_delayed),
 units_delayed = ifelse(How_Long_Delayed == "4 hrs", "minutes", units_delayed)
 ) %>%
 #sometimes the record just reads 'mins' or equivalent. My 'units_delayed' picks this up as 'minutes'. We take a guess and set this to 5 minutes.
 mutate(
 time_delayed = ifelse(units_delayed == "minutes" & str_detect(How_Long_Delayed, "[mM][iI][nN]") & is.na(time_delayed), 5, time_delayed)
 ) %>%
 filter(!(is.na(time_delayed))) %>%
 filter(time_delayed <= 300) %>% #Any delays beyond 5 hours are probably typos, but also problematic if real.
 filter(units_delayed == "minutes") %>% #222 820 / 259 637. The rest are garbage or NA.
 select(
 Busbreakdown_ID, School_Year, Route_Number, Schools_Serviced, Bus_Company_Name,
 Has_Contractor_Notified_Schools, Has_Contractor_Notified_Parents, Have_You_Alerted_OPT, Reason,
 Boro, Number_Of_Students_On_The_Bus, School_Age_or_PreK,
 Created_On, Occurred_On,
 time_delayed
 )

#---plot of the time_delayed
library(RColorBrewer)

pal <- brewer.pal(12, "Paired")

reason_levels <- ii_times %>%
 group_by(Reason) %>%
 mutate(
 total_of_reason = n()
 ) %>% distinct(Reason, total_of_reason) %>%
 arrange(total_of_reason) %>%
 ungroup() %>%
 mutate(total_of_reason = row_number())

ii_times %>%
 mutate(Reason = factor(Reason, levels = reason_levels[[1]])) %>%
 mutate(time_delayed = time_delayed) %>%
 ggplot(aes(x = time_delayed, fill = Reason)) +
 geom_bar() +
 theme_classic() +
 scale_fill_manual(values = pal) +
 coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
 labs(x="time_delayed [min]",y="Observations") +
 theme(text=element_text(family="Arial", size=16)) +
 theme(axis.line = element_line(colour = 'black', size = 1)) +
 theme(axis.ticks = element_line(colour = "black", size = 1)) +
 ggtitle("Observations by Reason")

A bar graph broken up Reason shows that, as observed when cleaning up the delay data, these are ‘human approximated delays’. The driver or attendant (or perhaps someone at operations) has approximated the delay to the nearest convenient time package: 5 minutes, 10 minutes, 15 minutes, 20, 25, 30, 40, 45, 50, 60, 90 minutes, and so on. Some delays have been figured out more accurately by comparing time differences, but most have not. One of the consequences is that if we decide to run a regression model for this data, we can never expect it to be more precise than 5-10 minutes of RMS error. Because that is how the data have been recorded.

Note that there are a few data with delays beyond 100 minutes, and these points will show up later. The graph is presented this way for clarity. Traffic constitutes the major source of delay in the data set; accident is the most rare problem. ‘Other’, meaning a missing reason or undisclosed reason, is the second most common value in this field.

We can have a look at some of the structure via density plots:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ii_times %>%
 mutate(Reason = factor(Reason, levels = reason_levels[[1]])) %>%
 mutate(time_delayed = time_delayed) %>%
 ggplot(aes(x = time_delayed, fill = Reason)) +
 geom_density(aes(y=..count..), size = 0, adjust = 4, position = "stack", color = NA) +
 theme_classic() +
 scale_fill_manual(values = pal) +
 coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
 labs(x="time_delayed [min]",y="Observation count") +
 theme(text=element_text(family="Arial", size=16)) +
 theme(axis.line = element_line(colour = 'black', size = 1)) +
 theme(axis.ticks = element_line(colour = "black", size = 1)) +
 ggtitle("Density: observations by Reason")

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
boro_levels <- ii_times %>%
 mutate(
 Boro = ifelse(is.na(Boro), "All Boroughs", Boro),
 ) %>%
 group_by(Boro) %>%
 mutate(
 total_of_boro = n()
 ) %>% distinct(Boro, total_of_boro) %>%
 arrange(total_of_boro) %>%
 ungroup() %>%
 mutate(total_of_boro = row_number())

ii_times %>%
 mutate(
 Boro = ifelse(is.na(Boro), "All Boroughs", Boro),
 ) %>%
 mutate(Boro = factor(Boro, levels = boro_levels[[1]])) %>%
 ggplot(aes(x = time_delayed, fill = Boro)) +
 geom_density(aes(y=..count..), size = 0, adjust = 4, position = "stack", color = NA) +
 theme_classic() +
 scale_fill_manual(values = pal) +
 coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
 labs(x="time_delayed [min]",y="Observation count") +
 theme(text=element_text(family="Arial", size=16)) +
 theme(axis.line = element_line(colour = 'black', size = 1)) +
 theme(axis.ticks = element_line(colour = "black", size = 1)) +
 ggtitle("Density: observations by Borough")

Here we must be careful with interpretation, since the density plot presents counts convolved with a gaussian density kernel. This helps us to see structure, but we must be careful interpreting the counts. There are still very few observations with delay time of 22 minutes, for example.

But we can see some structure a little more clearly. For example, Manhattan produces more long-time delays than the Bronx or Staten Island.

In the next set of notes we will try to fit a regression, based only on this data. However, we can already see the need to join more data on as we have few predictors. Before building a model or joining to the other data sets, there is a little more cleaning and dummifying of categorical variables to do:

", "") ) %>% spread(key = Reason, value = dummy, fill = 0L, sep = "_") %>% mutate(Reason_Other = NULL) %>% #default case is Other mutate(dummy = 1L, Boro = ifelse(is.na(Boro), "All Boroughs",Boro), Boro = str_replace_all(Boro, " ", "") ) %>% spread(key = Boro, value = dummy, fill = 0L, sep = "_") %>% mutate(Boro_AllBoroughs = NULL) %>% #default case is all mutate( Number_Of_Students_On_The_Bus = as.numeric(Number_Of_Students_On_The_Bus) #This field has some problems we will deal with later ) %>% filter(Number_Of_Students_On_The_Bus <= 80) #Although this cut is arbitrary, we choose not to believe the records with more than 80 people on the bus.

Related Post

  • Explaining Black-Box Machine Learning Models – Code Part 2: Text classification with LIME
  • Image clustering with Keras and k-Means
  • ‘How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm.
  • Machine learning: Logistic regression and decision trees for healthcare in R
  • Machine Learning Benchmarking with SFA in R

Related