Thompson Sampling Using Python

Several months ago I coded up a Python language example of the Thompson Sampling algorithm for the multi-armed bandit problem. I ran across my old demo code and saw a few things I didn’t like so I decided to refactor the code.

Imagine you are facing three slot machines. You have 8 tokens. The machines pay $1 if you win, and $0 if you lose. Each machine pays according to a different probability distribution that is unknown to you. You want to use a technique that gives you a good chance of winning the most money for your 8 pulls.

There are many algorithms for the multi-armed bandit problem. The four most common are epsilon-first, epsilon greedy, UCB1 (upper confidence bound), and Thompson Sampling.

Thompson Sampling is very clever. You keep track of how many times each machine has won and lost, to give you empirical probabilities of winning. But instead of just picking the machine that has the current best empirical probability, you sample from the mathematical Beta distribution. This gives you sampling probabilities that usually mirror the empirical probabilities, but every now and then will pick a different machine. This allows the algorithm to explore other arms and not get stuck on pulling a single arm.

I’ll write up a full description when I get some free time.

Illustrations by artists named Thompson. From left to right: Dave, Earl, Mike, Steve, Kelly.