Multi Armed Bandit

Imagine you are standing in front of a row of slot machines, and wish to gamble. You have a bag full of coins. Your goal is to maximize the return on your investment. The problem is that you don’t know the payout percentages of any of the machines. Each has a, potentially, different expected return.

What is your strategy?

Ideas

You could select one machine at random, and invest all your coins there, but what happens if you selected a poor payout machine? You could have done better.

You could spread your money out and divide it equally (or randomly) between all the different machines. However, if you did this, you’d spend some time investing in poorer payout machines and ‘wasting’ coins that could have be inserted into better machines. The benefit of this strategy, however, is diversification, and you’d be spreading your risk over many machines; you’re never going to be playing the best machine all the time, but you’re never going to be playing the worst all the time either!

Maybe a hybrid strategy is better? In a hybrid solution you could initially spend some time experimenting to estimate the payouts of the machines then, in an exploitation phase, you could put all your future investment into the best paying machine you’d discovered. The more you research, the more you learn about the machines (getting feedback on their individual payout percentages).

However, what is the optimal hybrid strategy? You could spend a long time researching the machines (increasing your confidence), and the longer you spend, certainly, the more accurate your prediction of the best machine would become. However, if you spend too long on research, you might not have many coins left to properly leverage this knowledge (and you’d have wasted many coins on lots of machines that are poor payers). Conversely, if you spend too short a time on research, your estimate for which is the best machine could be bogus (and if you are unlucky, you could become victim to a streak of ‘good-luck’ from a poor paying machine that tricks you into thinking it’s the best machine).

If you are playing a machine that is “good enough”, is it worth the risk of attempting to see if another machine is “better” (experiments to determine this might not be worth the effort).

Hybrid

Maybe we could devise an adaptive system that changes? After all, after each pull, we gain information about the system. Also, what happens if, heaven forbid, the machines slowly change their payout percentages as we interact with them?

Multi Bandit Problems

These kind of problems are known, affectionately, by Data Scientists as the “Multi-Bandit Problems” (in reference to slot machines, which are also known as one-armed bandits).

The multi-bandit problem has many, many applications:

Maybe you’re running an online advertising campaign and wish to optimize response rates and have three different creatives you wish to try. You could split the traffic, equally, three ways, but if you did that, then two thirds of the recipients would be viewing less responsive adverts for the entire campaign. If one of the adverts is a ‘dog’, wouldn’t you like to know that quickly, and adapt on the fly? Wouldn’t you want the most responsive adverts to start to be dynamically served to a higher percentage of people?
Similarly, if running a drug trial with a plurality of possible options, and early data shows that one of the medications is performing poorly (or conversely, one drug is far outshining the others), wouldn’t you want to know about this early in the trial? If a drug was performing very poorly, ethically, would you want to way to pull it from rotation early on in the trial?
In distributed computing, if you are farming jobs out to multiple devices, and some devices are performing slower than others, part of your load balancing might wish to adapt based on past performance of tasks sent to that box.
Maybe you are trying to experiment to find the best price to sell something.
Maybe you are trying to optimize the headline for an online article.
Maybe you are subcontracting out work, and want to ensure that the contractor who does a better job is getting a larger share of your business.
Maybe you want to find a good mix for your portfolio and investment planning …

There are many, heavily studied, algorithms for how to address the multi-bandit problem. Here is a high level look at a couple of them:

Epsilon-first strategy

In this strategy, the contest is broken into two phases: An exploration phase, follow by an exploitation phase. A value of epsilon ε is selected (e.g. ε=0.1), and for the first 10% of the time, a machine is selected entirely at random (uniform distribution of probability). After these trials are done, for the remaining trials (1-ε) the single machine that had the highest expected return from the exploration phase is used exclusively.

As you can see, this strategy uses a percentage of the expected pulls to determine the best machine to use, and for the rest of the pulls the best machine found is used.

Epsilon-Greedy Strategy

A modification of this is an adaptive approach. Instead of using a dedicated exploration phase, it is a continuous process. A value of epsilon ε is selected (e.g. ε=0.1), and for this percentage of trials, a random machine is selected. For the remaining 1-ε trials, the current leading machines is used.

In the above two strategies, if ε=0.0 then the algorithms are entirely greedy.

Epsilon-decreasing Strategy

A further hybrid of this is the epsilon-decreasing approach. Similar to Simulated annealing, the value of ε decreases over the experiment. Early on in the process, ε starts with a high value, so there is a good chance that a random exploration pull will be made. Over time, the value of ε decreases (typically as an exponential decay), so that a higher percentage of the later pulls are exploitation pulls and not an experimentation pulls.

There are more complex variants of the above based on using softmax weightings in the exploration phase.

Bayesian Bandits

Then there is a category of algorithms based on information on past pulls. Rather than selecting, digitally, the machine to pull, the handle of the next machine to pull could be based, probabilistically, on past performance. Yes, the “rich would get richer”, but there is still a chance that a less optimal machine might be selected (albeit with a lower probability), and if this turned out to be a good result, that it would get reinforcement, and thus more likely to be pulled in future rounds; gradually creeping up.

Secretary Puzzle

Readers of my blog might see an analogy between this problem and the Sultan’s Wife problem (also known as the sectrary problem). The differences, however, are obvious that, for the secretary problem, you don’t get chance to sample more than once, and you don’t get the option of returning.

You can find a complete list of all the articles here. Click here to receive email alerts on new articles.