Adaptive Algorithms for Infinitely Many-Armed Bandits

JMLR · 2026 · Emmanuel Pilliat

Recommending a near-best option when there are far more candidates than the sampling budget.

In the infinitely many-armed bandit problem, the pool of options (arms) is much larger than the number of pulls available — too large to try even once. The objective is therefore not to minimise regret but to recommend an arm whose mean reward ranks near the top of the pool. This paper introduces two distribution-free algorithms, OSE and PROSE, with near-optimal guarantees. The visualisation below runs them against the BSH baseline on a shared reservoir of arms; each panel reports the true rank of its current recommendation (1 = best arm).

speed 5 reservoir shape α 0.75 noise 0.50 arms 160 pull 0 / 1500

The reservoir — the quantile curve λ. Each algorithm's current pull ● , its noisy reward ✕ and its recommendation ★ ride along it.

BSHbracketed sequential halving · Zhao et al. 2023rank —

Brackets + sequential halving: pull the survivors, drop the worse half each round.

OSEoptimistic scope exploration · this paperrank —

Pull the highest-UCB arm within the exploration scope; recommend the highest-LCB arm.

PROSEprogressive ranking for OSE · this paperrank —

Rank arms by LCB; pull the highest-UCB within the top scope; recommend the highest-LCB arm.

Cumulative reward — total collected so far; the curve is the reward per pull (how well each algorithm exploits over time).

Each vertical whisker is an arm's confidence interval; the dot is its empirical mean, the faint tick its true mean. ▮ shaded = the current exploration scope · ▮ purple = the arm pulled this step · ★ gold = the recommended arm.

How the three play it

BSH (Zhao et al., 2023) — opens progressively larger brackets of arms and runs sequential halving inside each, repeatedly discarding the weaker half. Robust, but spends much of its budget re-confirming.
OSE — Optimistic Scope Exploration, the paper’s main algorithm. It keeps a growing scope of arms; each step it pulls the most optimistic arm (highest upper bound) in the scope and recommends the safest (highest lower bound) seen so far. Distribution-free, with near-optimal guarantees and no knowledge of the reservoir.
PROSE — Progressive Ranking for OSE, the practical version: it keeps arms ranked by their lower bound so the scope always holds the current best candidates. Same idea, state-of-the-art empirically — and, unlike BSH, it matches the long-run behaviour of classical UCB.

References & links

📄 arXiv : 2510.27319 Full paper — model, the OSE / PROSE algorithms, rank-corrected sample complexity, and transition regimes. Read on arXiv → 🎞️ Slides Talk deck: the reservoir, the main question, OSE in action, and the theoretical results. Open the talk → ⬇️ PDF Download the manuscript. Download →