Roulette Wheel Selection for Multi-Armed Bandit Problems

While I was waiting in an airport to fly to a conference recently, I got the idea of using a technique called roulette wheel selection to solve multi-armed bandit problems. Bottom line: the technique appears to work very well.

Imagine you’re standing in front of three slot machines. You have 40 tokens. Each machine pays either $1 or $0 per play according to a probability distribution that is unknown to you. Your challenge is to use your 40 tokens optimally and find the best machine.

There are many techniques to solve this kind of multi-armed bandit problem: epsilon-first, epsilon-greedy, UCB1, and Thompson sampling (beta distribution) are among the best known.

In pseudo-code, most of these techniques look like:

track number of S = successes (wins),
 and F = failures (losses) for each machine

loop 40 times
  use curr S and F values to pick a good machine
  play the selected machine
  update S or F for the played machine
end-loop

You don’t want to pick the current best machine on each iteration because the machines pay out probabilistically and you may not have correctly identified the true best machine. You need to balance exploring for the best machine with exploiting the current best machine.

So my thought was to use a technique from evolutionary optimization called roulette wheel selection to pick a machine to play. Suppose at some point in time the situation is:

machine 0 : S = 4, F = 2
machine 1 : S = 3, F = 1
machine 2 : S = 2, F = 3

You can use S and F to estimate the probability of each machine winning:

machine 0 : P = 4/6 = 0.67
machine 1 : P = 3/4 = 0.75
machine 2 : P = 2/5 = 0.40

You normalize the probabilities into fitness values that sum to 1.0:

machine 0 : f = 0.67 / (0.67 + 0.75 + 0.40) = 0.37
machine 1 : f = 0.75 / (0.67 + 0.75 + 0.40) = 0.41
machine 2 : f = 0.40 / (0.67 + 0.75 + 0.40) = 0.22

You want to select a machine in a way that is proportional to its fitness value, in other words, machine 1 should have the best chance to be selected, followed by machine 0, then machine 2.

Now the accumulated fitness values and associated index values are:

0.00   0.37   0.78   1.00
     0      1      2

You generate a random value between 0.0 and 1.0 which determines the selected machine. For example, if p = 0.30 (between 0.0 and 0.37), machine 0 is selected. If p = 0.55 (between 0.37 and 0.78), machine 1 is selected and so on.

I coded up a demo and the roulette wheel selection technique seems to work extremely well. The roulette wheel technique is closest to Thompson sampling. Thompson sampling has the drawback that as the values of S and F increase, the likelihood of picking the machine with the best current estimated probability of winning becomes close to 1. In other words, you don’t explore non-current-best machines. The roulette wheel selection technique does not have this disadvantage.

Wow. Maybe I’m smarter than I thought I was. Or maybe not.



Left: Roulette wheels are mechanical works of art. Center: Humphrey Bogart in “Casablanca” (1942). Right: Ursula Andress in “Casino Royale” (1967)

This entry was posted in Machine Learning. Bookmark the permalink.

1 Response to Roulette Wheel Selection for Multi-Armed Bandit Problems

  1. Thorsten Kleppe's avatar Thorsten Kleppe says:

    The roulette wheel selection algorithm sounds absolutely amazing. I have never heard about that, but your explanation was really fantastic.
    The idea feels like a softmax classification to squeeze the space to make decisions what is the best to do. Crazy stuff!

    I tried to code the algorithm, here is my short implementation:

    int S[]={4,3,2}, F[]={2,1,3};
    float machine[3]={};
    float selected = 0.5f, step=0, sum=0;

    for(int i=0;i<3;i++)
    sum += machine[i] = (float)S[i] / (S[i]+F[i]);

    for(int i=0;i=step&&selected<=machine[i] / sum + step)
    Print("machine ",i," selected by ",selected," between ",step," and ",machine[i] / sum + step);

    Hope the code is on a piece and you like it.

    I once had the case that the rand() was too slow.
    On the Intel site they showed a nice fast_rand() function on listing 1, that works around 2 times faster in my tests.
    https://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor

    With a lot of iterations this could be very powerful like this:
    selected = fast_rand() / 32767.0f;

Comments are closed.