In a paper at NIPS ’15, Judea Pearl and co. put together a toy example when all of the standard multi-armed bandit (MAB) algorithms fail. The paper shows that a MAB algorithm can do worse than random guessing, how to overcome the problem in one such case, and raises questions that need to be addressed when standard MAB algorithms are used in practice. And as long as we’re describing human learning as Bayesian updating, the paper is also interesting as a comment on optimizing human decision making.

But looking at the google scholar citations, that paper was only cited 17 times since then (well really 16, 1 paper is in there twice). So what gives?

The paper is a fairly straightforward extension of the MAB problem. In typical MAB, the goal is to specify a decision making algorithm for selecting a discrete choice (i.e. an arm of a casino slot machine) that minimizes regret over the trials played (i.e. the opportunity cost associated with selecting a suboptimal option). The paper calls out two issues in this setting: (i) when observations used to set the parameters of the algorithm are censored to exclude the high performing contexts, the algorithm can perform worse than random guessing, and (ii) when the context isn’t accounted for, even with absolutely random exploration the algorithm won’t end up selecting the optimal choice. Their simple solution is to augment standard Thompson sampling, which usually makes a selection by taking the arm with the highest reward sampled from a distribution modeling belief over arm rewards.  Here, the algorithm instead first samples arm values, then uses the difference between expected rewards from the two arms to scale down the sampled value of the highest arm, and finally plays the arm with the higher value after the adjustment.  Using the difference in expected rewards allows the algorithm to avoid conditioning on context directly, and just compares the two choices given the would-be selection.  In cases where the higher sampled value comes from the lower performing arm, the sampled value will eventually get scaled down to 0 and the arm will be selected less frequently.  Among the unaddressed points are: (i) how to compute this difference when there are more than 2 arms and the relationship between them is unknown, (ii) why is this the right way to include belief about counterfactual performance in the algorithm (are there more direct ways of including it?), and (iii) whether this can be done without Thompson sampling, which can get computationally expensive as the number of arms and contexts gets to be large.

The authors also argue that unobserved confounding may be more the rule than the exception in practice, citing the familiar example of doctors prescribing more expensive drugs to more affluent patients, which they use to draw a distinction between random sampling of prescription events and random assignment of drugs to patients. In that case, the effect of the drug can’t be decoupled from the effect of difference in lifestyle without also prescribing the same drug to the less privileged patients.

As an even more familiar example for those doing data science at tech companies, consider online ad auctions, where censoring of observations can happen through impression win rates. The table below reproduces values in Table 1 from the paper, with win probabilities taking the place of intuition. Let’s say what we want to do is set bid values for two ads in a campaign by adjusting the budget they need to spend in some defined amount of time. If this is done without considering the combinations of segments and sites when selecting a budget level, both ads could have performance ~(0.150.8 + 0.450.1), which would end up being significantly worse than bidding correctly on the low win rate / high conversion rate combinations of user segments and sites.

So given all this – a relatively simple paper, ample room for improvement, it’s broad relevance and the costly implications of ignoring ideas present in the same literature for over 20 years, why has so little work picked up on this thread?  Are tech companies successfully avoiding these issues?  And, how are we at the point where Pearl appears in a recent Atlantic article as one of machine learning’s “sharpest critics”?