Predictions from predictions.

NextBus Delay Tracker uses NextBus’s original predictions as inputs into its own future bus arrival prediction model. How well does this work in practice?

I started generating error estimates for NextBus’s predictions after missing a few buses due to my somewhat overzealous inclination to head out the door with only enough time to catch the bus exactly as it pulls up to the stop. Over time, while experimenting with different departure times to minimize my total wait, I learned to make helpful heuristic adjustments to NextBus’s predictions: for example, subtract a couple of minutes for Harvard predictions on Sunday mornings, and add at least five minutes to Newbury for the evening rush hour. Given the success of these heuristic adjustments, I wondered if I could look for patterns in NextBus’s predictions and programmatically generate more accurate predictions. That curiosity led to the creation of NextBus Delay Tracker (NBDT), a linear regression prediction model that estimates future NextBus errors using NextBus’s past prediction errors. Yet, generating more accurate predictions from less accurate predictions seems almost magical, and math provides no guarantee that this should work. How well does NBDT actually work, if at all?

Metrics and comparisons

We can evaluate NextBus Delay Tracker’s (NBDT’s) performance in a straightforward manner. For every NextBus prediction, we want to generate a corrected NBDT prediction, and we want to know when the bus actually arrived. Using these data, we can calculate prediction errors for both algorithms. After collecting a large sample of prediction errors, we can numerically characterize and quantitatively evaluate the performance of both algorithms by calculating and comparing statistical attributes about each error distribution, such as its mean and standard deviation.

The mean error and standard deviation are simple statistical quantities to calculate, but also quite useful. The mean error reveals how wrong predictions are on average, while the standard deviation measures how large of an error range most predictions tend to fall between. In this context, smaller numbers are better: prediction algorithms with small mean errors and small standard deviations are more accurate, as a whole, than algorithms with larger mean errors and standard deviations.

As an example, assume that for five arrival predictions, NextBus’s errors are (-2, -1, 0, 1, 2) minutes. For the same set of predictions, NBDT’s errors are (-4, -2, 0, 2, 4) minutes. Although both sets of predictions share a mean error of zero, NBDT’s errors span a larger range. Specifically, the standard deviation of NBDT’s errors is twice that of NextBus’s. In this example, most users would have received less accurate predictions from NBDT than from NextBus.

By tracking the prediction errors for NextBus and NBDT, and keeping in mind that smaller mean errors and smaller standard deviations indicate greater accuracy, we can now compare the relative performance of each algorithm. Is it possible to generate more accurate predictions from less accurate predictions?

Single-fit prediction errors by stop

NextBus Delay Tracker identifies patterns in prediction errors and then applies these patterns to modify future NextBus predictions. There are an infinite number of ways to implement this idea, some substantially fancier than others. In the first iteration of NBDT, I implemented a linear regression to answer the following question: “Based on historical predictions, errors, and recent delays within the last few hours, given a NextBus prediction for a particular bus stop, what is the most likely actual arrival time?” Because the inputs to the linear regression for this first iteration do not contain time-of-day information to demarcate attributes such as whether a particular prediction occurred during morning rush hour or on a holiday, the correction is insensitive to the time of day. I refer to this particular regression as the “single-fit” correction: on a per-bus-stop basis, NBDT applies the same correction to all incoming NextBus predictions.

The following charts compare the accuracy of NBDT against NextBus between May 14th and May 28th, 2016 for the 1 Bus headed south toward Dudley. The first histogram below illustrates the distribution of prediction errors for both algorithms. For all bus stops, the single-fit NBDT regression performed better than NextBus, exhibiting mean errors closer to zero and smaller standard deviations.

However, the mean and standard deviation comparisons do not capture the full reality. Closer inspection of NextBus and NBDT’s mean errors by time-of-day reveals that NBDT’s single-fit regression does not consistently outperform NextBus. In particular, as shown in the time-series chart below of NBDT and NextBus’s mean errors, the single-fit regression performs no better than NextBus from midnight onward, and performs worse than NextBus during the early morning before 6AM. By contrast, the single-fit regression performs particularly well in the early afternoon. Given the lack of time-of-day information and the artificial constraint of using one set of coefficients to correct all predictions, these trends represent unsurprising trade-offs related to the error-minimizing nature of the linear regression.

Hour-specific prediction errors by stop

Now that we know that the single-fit regression works well for some hours and poorly for others, can we use this time information to make an improvement? To make NextBus Delay Tracker time-aware, the second iteration introduced “multi-fit” regressions for each bus stop. Rather than limit each bus stop to a single correction applied uniformly throughout the day, the second iteration of NBDT groups predictions by time buckets—midnight to 3AM, 3AM to 5AM, 6AM to 3PM, 3PM to 8PM, 8PM to 10PM, and 10PM to midnight—and calculates a set of coefficients per bus stop and per time bucket. Predictions that come in at 5AM receive a correction specific to other 3AM to 5AM predictions, while predictions that come in at 5PM receive a correction specific to predictions between 3PM and 8PM. The time-series chart below plots NextBus’s mean prediction error against NBDT’s multi-fit prediction error. Splitting the dataset into time buckets improves NBDT’s performance for the hours that the single-fit regression performed poorly over (from midnight to 6AM) without sacrificing its accuracy during the evening rush hour.

Notably, grouping the predictions and corrections by time also improved the overall accuracy of NextBus Delay Tracker. The histogram below overlays the prediction errors for NextBus, NBDT single-fit, and NBDT multi-fit. For every stop along the 1 Bus’s route, NextBus’s predictions exhibit the largest mean error and standard deviation. While both NBDT’s single- and multi-fit regressions produce predictions with near-zero mean errors, the multi-fit regression always has the smallest standard deviation. Compared to NextBus, NBDT’s multi-fit regression predictions tend to be accurate more often; and when they are inaccurate, they are inaccurate to a lesser degree.

All this to wait less during the frigid winter?

On the question of whether the Californian in me can wait less at bus stops in the frigid Boston winter by programmatically improving NextBus predictions: yes! Applying a linear regression to past predictions and error rates and bucketing these corrections by time, we can reduce the mean-error to near zero and narrow the overall range of errors.

The idea of generating more accurate predictions from necessarily less accurate predictions should arouse at least a modicum of suspicion and skepticism in most reasonable people. NextBus Delay Tracker currently works only because NextBus’s algorithm produces predictions that reflect, implicitly, the current position of a bus and existing traffic conditions. If NextBus started generating predictions based exclusively on scheduled arrivals, and the “predictions” had no relation to real-time conditions, NextBus Delay Tracker would have much less success exploiting patterns to make improvements. The Californian in me thanks NextBus for reduced wait times at bus stops, and the engineer and scientist in me thank NextBus for providing such a fascinating problem to study.

This is the third article in a series of four. Other posts in this series:

Part 1: “This is the 1 Bus in Boston.”
Part 2: “A peek inside the block box.”
Part 4: “The Case for Public Transportation.”

Posted June 29, 2016

Next: 粥 (“jōk”) / congee: a porridge-like rice dish made by boiling water with rice until the rice grains explode.

I am Tommy Leung, an engineer and amateur chef. These are my curiosities. (RSS)