Multi-Armed Bandit Using Thompson Sampling With C#

Imagine you are standing in front of three slot machines. Each machine accepts a token and pays out either 0 dollars or 1 dollar, with some unknown probability. The probability of win/success might be different for each of the three machines. You have 120 tokens. How do you maximize your profit?

One approach is to use 120 / 3 = 40 tokens on each machine and hope for the best. Another approach is called epsilon-first: use 10 tokens on each machine, determine which machine is best based on that info, and use the remaining 90 tokens on what you believe is the best machine.

A common approach is epsilon-greedy. You start by using 10 tokens on each machine. Suppose the probabilities of winning at that point are machine A = 0.50, machine B = 0.60, machine C = 0.40. For your next pull, you pick a random machine with some small (epsilon) probability, say 0.05, or the current best machine with the remaining probability 0.95. After each pull you update the probabilities of winning and continue picking a random machine or the current best. The main problem with epsilon-greedy is that there’s no good way to determine epsilon.

An improved version of epsilon-greedy is called Thompson Sampling. It is based on the mathematical Beta distribution. For example, Beta(a=3, b=1) will generate a probability between 0 and 1 but if you select many probability values, on average the probability will be a / (a+b) = 3/4 = 0.75.

In Thompson Sampling, the a and b values represent number of successes and failures at any point in time. Suppose you start by using 10 of your 120 tokens on each machine and get:

machine A: 5 successes, 5 failures
machine B: 6 successes, 4 failures
machine C: 3 successes, 7 failures

If you apply the Beta distribution, the three probabilities of picking a machine will be between 0 and 1 but will tend to be:

machine A: 5 / (5 + 5) = 5/10 = 0.50
machine B: 6 / (6 + 4) = 6/10 = 0.60
machine C: 3 / (3 + 7) = 3/10 = 0.30

But you could get any probabilities. The idea is that you will most likely pick the current best machine, but every machine still has a chance to be picked. This balances exploitation (picking the best) with exploration (getting additional information).

Implementing Thompson Sampling is very simple, assuming you can generate a Beta distribution. The tricky part is getting started. The Beta(a, b) requires both a and b to be at least 1. So you can either seed all the success and failure values to dummy 1 values, or you can compute Beta(a+1, b+1) for each selection.

In my demo code, to generate Beta values I use what’s called the BA algorithm, by R.C.H. Cheng, in “Generating Beta Variates with Nonintegral Shape Parameters”, Communications of the ACM, April 1978, vol 21, No 4, pp 317-322.

By the way, note that Thompson Sampling is intended to maximize profit. It does not necessarily find the true probabilities of each machine, which is a subtlety different problem.

In my demo, I set up three machines with probabilities of winning = 0.3, 0.7, 0.5 and 10 tokens. The final results were:

Final Success vector:    1    2    1
Final Failure vector:    3    1    2

This means machine A won 1 time and lost 3 times for prob = 1/4 = 0.25, machine B won with prob 2/3 = 0.67, and machine C won with prob 1/3 = 0.33 however the amount won was only 1 + 2 + 1 = 4 dollars. If you knew the best machine was B, you could use all 10 tokens on it and win, on average, 10 * 0.70 = 7.0 dollars.

A one-armed bandit is a slang term for a gambling slot machine. Slots machines are the least elegant form of casino games.

Top: James Bond (actor Daniel Craig) faces the evil Le Chiffre, a financier to terrorists, at a high stakes Monte Negro poker table in “Casino Royale” (2006).

Middle: Bond (actor Sean Connery) faces the evil Emilio Largo, an international nuclear blackmailer, at a chemin de fer (baccarat) table in Nassau, in “Thunderball” (1965).

Bottom: Bond in his very first appearance in the first few minutes of “Dr. No” (1962) is at the fictional Le Cercle casino in London. His face is hidden from camera until he wins against Sylvia Trench and introduces himself with his famous “Bond, James Bond” line. The casino is actually the Les Ambassadeurs Club in Hyde Park, which is connected to the InterContinental Hotel. I stayed there recently and got to see the casino. It was very exciting!

Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

using System;

namespace BanditThompson
{
  internal class BanditThompsonProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin Bandit Thompson" +
        " sampling demo");
      Console.WriteLine("Goal is to maximize payout" +
        " from three machines");
      Console.WriteLine("Machines pay out with" +
        " probs 0.3, 0.7, 0.5");

      BetaSampler bs = new BetaSampler(1);

      int N = 3;  // number machines
      double[] means = new double[] { 0.3, 0.7, 0.5 };
      double[] probs = new double[N];  // all 0.0

      int[] S = new int[N]; // successes
      int[] F = new int[N]; // failures
      Random rnd = new Random(17);  // win/lose 4 14

      for (int trial = 0; trial "lt" 10; ++trial)
      {
        Console.WriteLine("\nTrial " + trial);

        for (int i = 0; i "lt" N; ++i)  // 3 probs
          probs[i] = bs.NextSample(S[i] + 1, F[i] + 1);

        Console.Write("sampling probs =  ");
        for (int i = 0; i "lt" N; ++i)
          Console.Write(probs[i].ToString("F4") + " ");
        Console.WriteLine("");

        int machine = ArgMax(probs);
        Console.Write("Playing machine " + machine);

        double p = rnd.NextDouble();  // [0, 1]
        if (p "lt" means[machine])
        {
          Console.WriteLine(" -- win");
          S[machine] += 1;  // or ++
        }
        else
        {
          Console.WriteLine(" -- lose");
          F[machine] += 1;
        }

      } // trial

      Console.Write("\nFinal Success vector: ");
      VecShow(S, 4, true);
      Console.Write("Final Failure vector: ");
      VecShow(F, 4, true);

      Console.WriteLine("\nEnd ");
      Console.ReadLine();
    } // Main

    public static int ArgMax(double[] vec)
    {
      int maxIdx = 0;
      double maxVal = vec[0];
      for (int i = 0; i "lt" vec.Length; ++i)
      {
        if (vec[i] "gt" maxVal)
        {
          maxVal = vec[i];
          maxIdx = i;
        }
      }
      return maxIdx;
    }

    public static void VecShow(int[] vec, int wid, bool nl)
    {
      for (int i = 0; i "lt" vec.Length; ++i)
        Console.Write(vec[i].ToString().PadLeft(wid) + " ");
      if (nl == true)
        Console.WriteLine("");
    }

  } // Program

  public class BetaSampler
  {
    public Random rnd;

    public BetaSampler(int seed)
    {
      this.rnd = new Random(seed);
    }

    public double NextSample(double a, double b)
    {
      double alpha = a + b;
      double beta = 0.0;
      double u1 = 0.0; double u2 = 0.0;
      double w = 0.0; double v = 0.0;
      if (Math.Min(a, b) "lte" 1.0)
        beta = Math.Max(1.0 / a, 1.0 / b);
      else
        beta = 
          Math.Sqrt((alpha - 2.0) / (2 * a * b - alpha));

      double gamma = a + (1.0 / beta);
      int sanityCt = 0;
      int maxSanity = 100000;
      while (sanityCt "lt" maxSanity)
      {
        u1 = this.rnd.NextDouble();
        u2 = this.rnd.NextDouble();
        v = beta * Math.Log(u1 / (1 - u1));
        w = a * Math.Exp(v);
        double tmp = Math.Log(alpha / (b + w));

        if (alpha * tmp + (gamma * v) - 1.3862944 "gte" 
          Math.Log(u1 * u1 * u2))
          break;
        ++sanityCt;
      }
      if (sanityCt "gte" maxSanity)
        throw new Exception("Fail");
      double x = w / (b + w);
      return x;
    }
  } // BetaSampler

} // ns