n-armed bandit problem¶
We’re going to build our way up from very simple RL algorithms to much more sophisticated ones that could be used to learn to play games, for example. The theory and math builds on each preceding part, so I strongly recommend you follow this series in order even though the first parts are less exciting.
Let’s consider a hypothetical problem where we’re at a casino and in a section with some slot machines. Let’s say we’re at a section with 10 slot machines in a row and it says “Play for free! Max payout is $10!” Wow, not bad right! Let’s say we ask one of the employees what’s going on here, it seems too good to be true, and she says “It’s really true, play as much as you want, it’s free. Each slot machine is gauranteed to give you a reward between 0 and $10. Oh, by the way, keep this on the down low but those 10 slot machines each have a different average payout, so try to figure out which one gives out the most rewards on average and you’ll be making tons of cash!”
What kind of casino is this?! Who knows, but it’s awesome. Oh by the way, here’s a joke: What’s another name for a slot machine? …. A one-armed bandit! Get it? It’s got one arm (a lever) and it generally steals your money! Huh, well I guess we could call our situation a 10-armed bandit problem, or an n-armed bandit problem more generally, where n is the number of slot machines.
Let me restate our problem more formally. We have n possible actions (here n = 10) and at each play (k) of this “game” we can choose a single lever to pull. After taking an action $a$ we will receive a reward $R_k$ (reward at play k). Each lever has a unique probability distribution of payouts (rewards). For example, if we have 10 slot machines, slot machine #3 may give out an average reward of $9 whereas slot machine #1 only gives out an average reward of $4. Of course, since the reward at each play is probabilistic, it is possible that lever #1 will by chance give us a reward of $9 on a single play. But if we play many games, we expect on average slot machine #1 is associated with a lower reward than #3.
Thus in words, our strategy should be to play a few times, choosing different levers and observing our rewards for each action. Then we want to only choose the lever with the largest observed average reward. Thus we need a concept of expected reward for taking an action $a$ based on our previous plays, we’ll call this expected reward $Q_k(a)$ mathematically. $Q_k(a)$ is a function that accepts action $a$ and returns the expected reward for that action. Formally, That is, the expected reward at play k for action $a$ is the arithmetic mean of all the previous rewards we’ve received for taking action a. Thus our previous actions and observations influence our future actions, we might even say some of our previous actions reinforce our current and future actions. We’ll come back to this later.
Some keywords for this problem are exploration and exploitation. Our strategy needs to include some amount of exploitation (simply choosing the best lever based on what we know so far) and some amount of exploration (choosing random levers so we can learn more). The proper balance of exploitation and exploration will be important to maximizing our rewards.
So how can we come up with an algorithm to figure out which slot machine has the largest average payout? Well, the simplest algorithm would be to select action $a$ for which this equation is true: This equation/rule states that the expected reward for the current play k for taking action $A$ is equal to the maximum average reward of all previous actions taken. In other words, we use our above reward function $Q_k(a)$ on all the possible actions and select the one that returns the maximum average reward. Since $Q_k(a)$ depends on a record of our previous actions and their associated rewards, this method will not select actions that we haven’t already explored. Thus we might have previously tried lever 1 and lever 3, and noticed that lever 3 gives us a higher reward, but with this method, we’ll never think to try another lever, say #6, which, unbeknownst to us, actually gives out the highest average reward. This method of simply choosing the best lever that we know of so far is called a “greedy” method.
Obviously, we need to have some exploration of other levers (slot machines) going on to discover the true best action. One simple modification to our above algorithm is to change it to an $\epsilon$ (epsilon)-greedy algorithm, such that, with a probability $\epsilon$, we will choose an action $a$ at random, and the rest of the time (probability $1-\epsilon$) we will choose the best lever based on what we currently know from past plays. So most of the time we play greedy, but sometimes we take some risks and choose a random lever and see what happens. This will of course influence our future greedy actions.
Alright, I think that’s an in-depth enough discussion of the problem and how we want to try to solve it with a rudimentary RL algorithm. Let’s start implementing this with Python.