Reinforcement Learning - Monte Carlo Methods

Playing Blackjack with Monte Carlo Methods¶

Introduction¶

In part 1, we considered a very simple problem, the n-armed bandit problem, and devised an appropriately very simple algorithm to solve it ($\epsilon$-greedy evaluation). In that case, the problem only has a single state: a choice among 10 actions with stationary probability distributions of rewards. Let’s up the ante a bit and consider a more interesting problem with multiple (yet finite) states: the card game black jack (aka 21). Hunker down, this is a long one.

Rules and game-play of blackjack (check out https://www.youtube.com/watch?v=qd5oc9hLrXg if necessary):

There is a dealer and 1 or more players that independently play against the dealer.
Each player is delt 2 cards face-up. The dealer is delt two cards, one face-up, one face-down.
The goal is to get the sum of your cards value to be as close to 21 as possible without going over.
After the initial cards are dealt, each player can choose to ‘stay’ or ‘hit’ (ask for another card).
The dealer always follows this policy: hit until cards sum to 17 or more, then stay.
If the dealer is closer to 21, the dealer wins and the player loses, and vice versa.

So what’s the state space for this problem? It’s relatively large, much much larger than the single state in n-armed bandit. In reinforcement learning, a state is all information available to the agent (the decision maker) at a particular time $t$. The reason why the n-armed bandit state space includes just 1 state is because the agent is only aware of the same 10 actions at any time, no new information is available nor do the actions change.

So what are all the possible combinations of information available to the agent (the player) in blackjack? Well, the player starts with two cards, so there is the combination of all 2 playing cards. Additionally, the player knows one of the two cards that the dealer has. Thus, there are a lot of possible states (around 200). As with any RL problem, our ultimate goal is to find the best policy to maximize our rewards.

A policy is roughly equivalent to a strategy. There are reinforcement learning methods that essentially rely on brute force to compute every possible action-state pair (every possible action in a given state) and the rewards received to find an optimal policy, but for most of the problems we care about, the state-action space is much too large for brute force methods to be computationally feasible. Thus we must rely on experience, i.e. playing the game, trying out various actions and learning what seems to result in the greatest reward returns; and we need to devise an algorithm that captures this experiential learning process.