I recently analyzed a somewhat puzzling data set. I was sending HTTP POST requests to a system. The system’s would then acknowledge receipt of these requests (returning a 200 status code), and some time later (it was a slow asynchronous process) send a web hook to a specified URL if the request was successful. However, successful was far from certain; most requests actually failed. My job was to measure the success rate.
Concretely, event A
would trigger at some time t0
. If A
was successful, then event B
might occur at time t1
. B
can only occur if A
occurred.
Similar systems like this happen in a variety of contexts:
-
Ad delivery. The ad must first be displayed (event
A
), and only after it’s displayed can the viewer click a link (eventB
). -
Email. The email must first be opened (event
A
), and only after it’s opened can the reader click a link (eventB
). -
Web forms. A user must first enter their credit card, and only after that can they click submit.
What I wanted to compute was alpha = P(A)
and beta = P(B | A)
.
When analyzing the data I had, I noticed a curious pattern.
That last row (for request ID jkl
) indicates something really weird happening. It suggests that event B
has occurred even though event A
has not!
According to my model, which I have a high degree of confidence in, this isn’t possible. Yet it’s in the data; the responding system could not post a web hook with ID jkl
if they hadn’t received the request; they couldn’t possibly know this ID.
The conclusion I drew is that our measurements of A
and B
are unreliable. A
and B
may actually occur without being observed. So the real problem at hand is to infer the true rates at which A
and B
occur from the complete data set.
Some simplified calculations
I’ll begin with some simple calculations - using nothing but arithmetic - to give the flavor of this analysis. To make things concrete, suppose we have the following data set:
Suppose we have the following counts:
-
100k requests were made
-
In 40k cases, event
A
was reported andB
did was not reported -
In 10k cases, event
A
was reported and thenB
was reported -
In 5k cases, event
B
was reported butA
was never reported
The most naive possible approach is to simply treat the cases where B
occurred to be bad data and discard them. Then we can estimate:
But we can do better than this. We can use logical inference to deduce that in every case where B
occurred, A
also occurred. So we actually know that A
occurred 55k times, and A
then B
occurred 15k times. So we can then estimate:
Finally, there’s a third approach we can take. Lets define the parameters gamma_A = P(A reported | A occurred)
and gamma_B = P(B reported | B occurred)
. Lets assume that gamma_A = gamma_B = gamma
; this is reasonable in the event that events A
and B
are measured by the same mechanism (e.g., a tracking pixel).
Then we can infer, based on the fact that B
occurred at least 500 times without A
being reported, that approximately 10% (5k A occurrences without reports/ 50k A reports) of the time, data is lost. This suggests gamma ~= 0.9
.
We can then estimate that there were 50k / 0.9 = 55.56k occurrences of A
and 15k / 0.9 = 16.67k occurrences of B
, yielding:
Small errors result in big differences!
Based on the data we have, we’ve guesstimated that approximately 10% of the events which occur are not reported. However, this effect cascades and results in an overall success rate of alpha * beta
being reported as 10.5% (= 1,000 / 9,500) rather than 16.7% (= 1667 / 10,000). That’s a huge difference!
Statistical method
These calculations are all great, but we also need to deal with uncertainty. It’s possible that actually gamma=0.95
but we simply got unlucky, or gamma=0.85
and we got very lucky. How can we quantify this?
This can be done relatively straightforwardly with pymc3.
The results can then be plotted:
As is expected, we have sharp lower bounds; the true number of events could not be lower than our observed number of events.
These numbers are in rough accordance with our heuristic calculations above.
Extracting fundamental parameters
In the above data, we’ve done two important things.
First, we’ve built a nowcast of our underlying data. That is to say, while the number of times events A
and B
occur is nominally directly observable (albeit noisily), the actual number of times are not. So we can construct better estimates (as well as credible intervals) of the event occurrent counts.
Second, we’ve built a direct probabilistic way of computing the fundamental parameters of the problem, namely alpha
and beta
. In our pymc code, just as we can plot a histogram of a_occurrences
(via pl.hist(trace['a_occurred'][::20], bins=50)
), we can similarly plot a histogram of alpha
itself. In many instances - e.g. A/B testing or bandit algorithms - the underlying probabilities are the parameter of direct interest. The actual counts are only incidental.
The conclusion here is that missing data is not a fundamentally limiting factor in running many analysis. Provided you have a more complete generative model of data collection, and adequate data to fit the model, you can actually correct for missing data when running such analysis.