- Basic Statistics
Tags
- Data Management
- R Programming
We are following on from our previously loaded data of NYC bus delays, and from our simple Cubist regression fit. We know that we need to add in more predictors, and we have access to company level data. A warning to the faint-hearted: this post is essentially all data manipulation and joins. No pretty graphs, no conclusions. I am documenting the slog because the slog is required. We will move on to better things afterwards.
Preparation of company and route level data
We read in all of our data and take a look:
1 |
|
In my actual workflow, I wrote out all of my intermediate outputs as files that I could keep, and joined them in a separate workbook. In lieu of that process, the code below the fold simply renames some of the data sets that we have developed above. Apologies for any confusion: I promise that my project organization is a little less chaotic than this reporting markdown presentation.
1 |
|
Joining the data
After a lot of messing around, we are at the stage where we can join the company level staff and vehicle data onto the main data set. We also use the times to make variables that tell us how close to either rush hour we are.
Related Post
- Visualizations for correlation matrices in R
- Interpretation of the AUC
- Simple Experiments with Smoothed Scatterplots
- Understanding the Covariance Matrix
- Six Sigma DMAIC Series in R – Part 3
Related