Review: “Bandit Algorithms for Website optimization” by John Myles White

Sergey Andreev
5 min readJul 19, 2021
Photo by Lee Thomas on Unsplash

I came across “Bandit Algorithms for Website optimization” as a reference from an article posted by Jay Kreps. The book provides a gentle introduction into the space of multi-armed bandit algorithms and their applications for website optimization as an alternative to A/B testing. John Myles White provides the implementation details for each algorithm described in the book and helps to build the intuition on how they work and how to successfully apply them. Personally, it was great and refreshing to dig into the details of the algorithms and the math behind them.

Summary

The multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice’s properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice.
Wikipedia

Multi-armed bandit algorithms can provide an algorithmic solution to A/B testing problem. In general A/B testing technique is used to perform a controlled experiment between two options. Users are randomly assigned to one of two groups: Group A and Group B. The experiment runs until there is enough evidence that Option A is more successful than Option B (or vice versa) or there is a pre-defined end time. After that, all users are offered the more successful option.

Standard A/B test consists of a short period of pure exploration (both options are available to the users based on their respective group) and a long period of pure exploitation (all users are sent to the winning option). Depending on the number of users, the groups can either be split in 10/90 or any other proportion that can give meaningful statistical results.

A few challenges of A/B testing are sudden jump between exploration and exploitation and that strikingly inferior options are still being explored over the duration of the experiment. Furthermore, you only run the experiment between two available options. Bandit algorithms provide solutions to these problems.

The book covers three multi-armed bandit algorithms: epsilon-Greedy, Softmax and UCB1.

epsilon-Greedy

The algorithm generally exploits the option (arm) that has been most successful historically but every once in a while it explores the other available options (arms). The epsilon parameter allows to configure the odds of exploration vs exploitation. The epsilon-Greedy algorithm exploits the best known option with probability 1 — epsilon and it explores all available options with probability epsilon/n where n is the number of options to choose from.

A few weakness of the epsilon-Greedy algorithm:

  • it continues to over-exploint over the time even if there is a clear winning arm
  • the new options are only tried epsilon percent of the time so it takes time to build enough confidence about the winning option
  • it explores options completely at random without any concern about their merits. If the difference in reward rates between arms is small, you need to explore more often than 10% of the time to correctly determine the winning arm. But, if the difference is large, you need to explore a lot less than 10% of the time to correctly pick the winning arm.

Softmax

The softmax algorithm uses the structured exploration. It tries to cope with arms differing in estimated value by explicitly incorporating information about the reward rates of the available arms into its method for choosing which arm to select when it explores. The Softmax algorithm explores by randomly selecting from all of the arms with probabilities that are more-or-less proportional to the estimated value of each of the arms.

Algorithm will select one of two arms with probabilities computed as follows:

p(rA) = exp(rA / tau) / (exp(rA / tau) + exp(rB / tau))

p(rB) = exp(rB / tau) / (exp(rA / tau) + exp(rB / tau))

where rA is the rate of success for arm A and rB is the rate of success for arm B. tau is a configurable parameter which is called temperature. After that the reward for the picked arm is updated as a running average the same way as it is done for the epsilon-Greedy algorithm.

Both algorithms tend to over-explore with the time but there is a way to address it — annealing. It’s a technique that allows to dynamically adjust the amount of time the algorithm explores. It can be done by slowly decreasing the epsilon for the epsilon-Greedy algorithm and by slowly decreasing the temperature for the Softmax algorithm. Annealing is a great way to automatically tune the hyperparameter for the Softmax algorithm and remove the non-trivial task of picking the right value for the temperature.

Upper Confindence Bound (UCB1)

The above mentioned algorithms both have major weakness: they don’t keep track of how much they know about any of the arms available to them. They will under-explore options whose initial experiences were not rewarding, even though they don’t have enough data to be confident about those arms. They are easily misled by a few negative experiences. However, because of their use of randomness, they can make up for this later.

The big improvement of the UCB algorithm is that it doesn’t use randomness and it doesn’t have any parameters to configure. UCB tries all the arms at least once. Then it defines a confidence interval that has a reasonable chance of containing a true value of the arm inside it. It replaces each arms estimated value with the upper bound on the confidence interval for its value and picks the arm with the best estimated value. In the beginning of the experiment, UCB produces a lot of noise as it tries to learn about all arms however it smoothes out as the time goes and converges to the optimal arm.

One of the challenges of bandit algorithms is that it is hard to debug them as they rely on active learning unlike straight machine learning algorithms. Therefore, the author presents a simple framework to test multi-bandit algorithms based on the Monte Carlo simulation and provides tips to build the intuition behind the algorithms. The test framework allows to experiment with the algorithms without actually running them on the live environment.

The book also shares different ways to analyze the results of the simulation in order to assess the performance of the algorithm:

  • track the probability of choosing the best arm — it is the simplest way to analyze the performance however it strongly depends on the length of the horizon and it is too focused on the performance of the algorithm at each fixed point in time.
  • track the average reward at each point in time — it helps when there are many arms similar to the best but it is still too focused on the performance at each fixed point in time.
  • track the cumulative reward at each point in time — it focuses on the cumulative performance over time and helps to decide if the increased exploration is worth the trouble.

At the end of the book, the author presents the challenges of using multi-armed bandit algorithms in live environments. For example, a reward function that represents a click that results in a purchase. It is challenging to use because you don’t know the reward at the time when the option is displayed to the user so you need to figure out a way to properly incorporate in the algorithm.

Verdict

Recommended (even for the non-technical people)

--

--

Sergey Andreev

CEO/Founder at Torify Labs, ex-PayPal, ex co-founder/CTO at Jetlore Inc.