The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

Shie Mannor, John N. Tsitsiklis; 5(Jun):623--648, 2004.


We consider the multi-armed bandit problem under the PAC ("probably approximately correct") model. It was shown by Even-Dar et al. (2002) that given n arms, a total of O((n2)log(1/δ)) trials suffices in order to find an ε-optimal arm with probability at least 1-δ. We establish a matching lower bound on the expected number of trials under any sampling policy. We furthermore generalize the lower bound, and show an explicit dependence on the (unknown) statistics of the arms. We also provide a similar bound within a Bayesian setting. The case where the statistics of the arms are known but the identities of the arms are not, is also discussed. For this case, we provide a lower bound of Θ((1/ε2)(n+log(1/δ))) on the expected number of trials, as well as a sampling policy with a matching upper bound. If instead of the expected number of trials, we consider the maximum (over all sample paths) number of trials, we establish a matching upper and lower bound of the form Θ((n2)log(1/δ)). Finally, we derive lower bounds on the expected regret, in the spirit of Lai and Robbins.