Adaptive Algorithms for Infinitely Many-Armed Bandits

Introduction

This Talk

  • Sequential decision problems
  • Unbounded set of options
  • Limited sampling budget

Classical Bandit Problems

Consider a two-armed bandit problem, \(\mathcal A=\{1, 2\}\)

we observe:

\(\newcommand{\and}{\quad \mathrm{and} \quad}\)

\(X_{1,s} = \mu_{1} + \varepsilon_{1,s} \quad \and \quad X_{2,s} = \mu_{2} + \varepsilon_{2,s}\)

  • \(\mu_1\), \(\mu_2\) belong to \(\mathbb R\)
  • \(\varepsilon_{i,s} \sim \mathcal N(0,1)\) (or 1-subgaussian)

Active Learning Setting

At each timestep \(t\)

  • Based on past obsevation
  • The learner pulls arm \(\hat a_t \in \{1, 2\}\) and get reward \(X_{\hat a_t, s}\)
  • \(s = N_{\hat a_t,t}\) is the number of times arm \(\hat a_t\) has been pulled

Cumulative Regret

Motivation:

  • Overall cost of learning
  • Every suboptimal pull incurs a loss

Defined as

\(R = \sum_{t=1}^n (\mu_1 - \mu_{j_t})\)

UCB Algo

Optimism in face of uncertainty:

Choose arms that maximizes upper-confidence bound

\(UCB_{a,t} \asymp \hat \mu_a + \frac{c}{\sqrt{N_{a,t}}}\)

Illustration of UCB

Upper Bound on Cumulative Regret

Assume \(\mu_1 > \mu_2\), let \(\Delta = \mu_1 - \mu_2\) be the gap

Main idea. If the confidence interval are disjoint, we detect best arm:

Simplified condition.

\(\frac{1}{\sqrt{t}} \lesssim \Delta\)

We detect best arm after time \(t \gtrsim 1/\Delta^2\)

\(R \lesssim \max(\frac{\Delta}{\Delta^2}, \Delta t) \lesssim \sqrt{t}\)

(worst case when \(\Delta \asymp 1/\sqrt{t}\))

Bandits with Infinitely Many Arms

Setting

Let \(\mathcal A = \{1, 2, \dots, \}\).

Observations for arm \(a\) (if ever observed):

\(X_{a,s} = \mu_a + \varepsilon_{a,s}\)

  • \(\mu_a\) are drown independently from some distribution \(\mathcal D\)
  • \(\varepsilon_{a,s}\) are independent \(\mathcal N(0,1)\)

Alternative formulation

\(X_{a,s} = \lambda_{\gamma(a)} + \varepsilon_{a,s}\)

  • \(\eta \mapsto \lambda_{\eta}\) is a decreasing function
  • \(\gamma(a)\) are iid \(\mathcal U(0,1)\)

\(\rightarrow\) \(\gamma(a)\) represents a continuous rank of arm \(a\)

\(\rightarrow\)best arm” corresponds to \(\gamma(a)=0\)

\(\rightarrow\) Unbounded distribution of arm means iff \(\lambda_0 = +\infty\)

When Cumulative Regret Fails

Consider the case where \(\lambda_{\gamma(a)} \in \{0, \epsilon\}\) and

\(\mathbb P(\lambda_{\gamma(a)} = \epsilon) = \eta_0\):

\(\lambda_{\eta}= \begin{cases} \epsilon &\text{if } \eta \leq \eta_0\\ 0 &\text{otherwise } \\ \end{cases}\)

De Heide et al. (2021) established that it is impossible to adapt to \(\eta_0\) to get a “good” cumulative regret

It is possible to adapt to \(\eta_0\) if we consider simple regret

Simple Recommendation Reward

At each timestep \(t\), the learner:

  • Pulls an arm \(\hat a_t \in \mathcal A\)
  • Recommends an arm \(\hat r_t\)

The simple expected reward is \(\lambda_{\gamma(\hat r_t)}\)

\(\epsilon\)-Good Arms and Simple Regrets

Two main criteria in the litterature:

  • \(\epsilon\)-good arm: \(\mathbb P(\lambda_{\gamma(\hat r_t)} \geq \lambda_0 - \epsilon) \geq 1-\delta\)
  • Simple regret: \(\lambda_0 - \lambda_{\hat r_t}\)

\(\to\) These are anytime criteria.

\(\to\) In fixed budget settings \(t \in \{1, \dots, T\}\), simple regret can be defined as \(\lambda_0 - \lambda_{\hat a_t}\)

\(\to\) we need boundedness (\(\lambda_0 < +\infty\))to define them!

Main Question

Given quantile function \(\lambda\),

What simple reward can we achieve?

Theoretical Results

Main Result

Let \(\mathcal D\) be any distribution (potentially unbounded).

Define for \(0< \rho < \nu < 1\):

\(G(\rho, \nu) = \frac{\nu}{\rho} \frac{1}{(\lambda_{\rho}- \lambda_{\nu})^2} \lor \frac{1}{\rho}\)

  • \(\frac{1}{(\lambda_{\rho}- \lambda_{\nu})^2}\) correspond to the inverse squared gap between ranks \(\nu\) and \(\rho\)
  • It would correspond to the sample complexity if we had only two arms with mean \(\lambda_\nu\) and \(\lambda_{\rho}\)
  • For one arm s.t. \(\gamma(a) \leq \rho\), there are \(\nu/\rho\) arms s.t. \(\gamma(a) \leq \nu\)

Main Result

\(G(\rho, \nu) = \frac{\nu}{\rho} \frac{1}{(\lambda_{\rho}- \lambda_{\nu})^2} \lor \frac{1}{\rho}\)

We are able to detect quantile \(\eta\) at time \(t\)

\(\exists \rho < \eta\) s.t. \(\forall \nu \geq \eta\),

\(G(\rho, \nu) \ll t\)

(\(\ll\) hides a polylog factor in \(t\) and error probability \(\delta\))

This results is also valid for unbounded distributions !

Consequences

  • \(\zeta\): noise level
  • \(\mathbf{t}=t \cdot (polylog)\)
Type of the distribution \(\mathcal{D}\) Quantile Function \(\lambda_{\eta}\) Upper Bound on \(\gamma(\hat r_t)\) Lower Bound on \(\lambda_{\gamma(\hat r_t)}\)
Bernoulli \(u\mathbf{1}\{\eta \leq \eta_0\}\) \(\eta_0 \mathbf{1}\{\mathbf{t} \geq \frac{\zeta^2}{\eta_0 u^2}\}\) \(u \mathbf{1}\{\mathbf{t} \geq \frac{\zeta^2}{\eta_0 u^2}\}\)
Beta (\(\alpha < 1/2\)) \(1-\eta^{\alpha}\) \(\frac{1 \lor \frac{\zeta^2}{\alpha^2}}{\mathbf{t}}\) \(1-\left(\frac{1 \lor \frac{\zeta^2}{\alpha^2}}{\mathbf{t}}\right)^\alpha\)
Beta (\(\alpha \geq 1/2\)) \(1-\eta^{\alpha}\) \(\frac{1}{\mathbf{t}} \lor \left(\frac{\zeta^2}{\mathbf{t}}\right)^{\frac{1}{2\alpha}}\) \(1-\frac{1}{\mathbf{t}^\alpha} \lor \sqrt{\frac{\zeta^2}{\mathbf{t}}}\)
Pareto (\(\alpha < 1/2\)) \(\eta^{-\alpha}\) \(\frac{1}{\mathbf{t}} \lor \left( \frac{\zeta^2}{\alpha^2\mathbf{t}}\right)^{\frac{1}{1-2\alpha}}\) \(\mathbf{t}^\alpha \land \left(\frac{\alpha^2\mathbf{t}}{\zeta^2}\right)^{\frac{\alpha}{1-2\alpha}}\)
Pareto (\(\alpha \geq 1/2\)) \(\eta^{-\alpha}\) \(\frac{1}{\mathbf{t}}\lor \mathbf{1}\{\mathbf{t} \geq \zeta^{1/\alpha}\}\) \(\mathbf{t}^\alpha \mathbf{1}\{\mathbf{t} \geq \zeta^{1/\alpha}\}\)

Algorithms

Litterature

  • Carpentier & Valko (2015): Assumes that \(\lambda_{\eta} = 1- \eta^{\alpha}\) and estimate \(\alpha\) to adapt
  • Jamieson et al. (2013): Bracketing trick consisting in increasing disjoint sets of explored arms
  • Zhao et al. (2023): Sequential halving, doubling trick and bracketing trick

OSE and PROSE

  • OSE (Optimistic Scope Exploration)
  • PROSE (Progressive Ranking for Optimistic Scope Exploration)

OSE

\(LCB_{a,t} = \hat \lambda_{\gamma(a)} - \frac{C}{\sqrt{N_{a,t}}} \quad \text{and} \quad UCB_{a,t} \asymp \hat \lambda_{\gamma(a)} + \frac{C}{\sqrt{N_{a,t}}}\)

At each time \(t\):

  • Draw an exploration scope \(Z \sim t^U\), where \(U \sim \mathcal U(0,1)\)
  • Pull UCB over \(\{1, \dots, Z\}\), \(\hat a_t = \mathrm{argmax}_{a \leq Z} UCB_{a,t}\)
  • Recommend LCB over all arms, \(\hat r_t = \mathrm{argmax}_{a \in \mathcal A} LCB_{a,t}\)

There is also a deterministic version (better in practice and computationally more efficient)

Why do we take \(Z \sim t^U\) ?

  • Assume that for any \(q \in (0,1), \lambda_{q} \asymp \lambda_{q/2}\)
  • If \(Z \asymp t^{k/\log_2(t)}=2^k\),

\(\exists a \leq Z\) s.t. \(\gamma(a) \lesssim 2^{-k}\)

  • This happens with non-negligible probability:

\(\mathbb P(t^{k/\log_2(t)} \leq Z \leq t^{k/\log_2(t)}) = 1/\log_2(t)\)

Result on OSE

Recall that for \(\rho < \nu\), we define

\(G(\rho, \nu) = \frac{\nu}{\rho} \frac{1}{(\lambda_{\rho}- \lambda_{\nu})^2} \lor \frac{1}{\rho}\)

Theorem

Let \(\eta \in (0,1)\) and assume that \(G(\rho, \nu) \leq \tfrac{t}{C\log^C(t/\delta)}\).

Then, with probability at least \(1-\delta\), OSE recommends arms \((\hat r_t)_{t \geq 1}\) that satisfy \(\gamma(\hat r_t) \leq \eta\)

Practical Limitations

  • \(\mathbb P(Z=1) = 1/\log_2(t)\)
  • Even if we know first arms are bad, we keep pulling them a lot (\(t/\log_2(t)\))
  • Idea: rank the arms iteratively with a pessimistic metric

PROSE

Start from some arbitrary ranking \(\pi\). For each \(t\):

  • Draw an exploration scope \(Z \sim t^U\), where \(U \sim \mathcal U(0,1)\)
  • Pull UCB over \(\{1, \dots, Z\}\), \(\hat a_t = \mathrm{argmax}_{\pi(a) \leq Z} UCB_{a,t}\)
  • Sort the arms by LCB: \(LCB_{\pi^{-1}(1),t} \geq LCB_{\pi^{-1}(2),t}\dots\)
  • Recommend LCB over all arms, \(\hat r_t = LCB_{\pi^{-1}(1),t}\)

Naive implementation: time complexity \(\asymp \tilde O(t^2)\)

Deterministic version + cascading trick for max: time complexity \(\asymp \tilde O(t)\)

Numerical Study

Experimental Setting

  • \(\lambda_{\eta} = 1-\eta^{\alpha}\) with \(\alpha \in \{0.25, 0.5, 1, 2\}\)
  • \(K = 5000\) arms
  • \(T_{\mathrm{max}}=50000\)
  • Mean of simple regret trajectories over \(N=2000\) trials

Main Result

Influence of Tuning Parameter

Cumulative Regret

Implementation Details

  • PROSE (optimized deterministic version): \(\sim\) 76 ms/trial
  • BSH (home made implementation): \(\sim\) 89 ms/trial
  • BSH (original version from ):
    • \(\sim 14s\)/trial (slow pace language Python or R).
    • ignores the authors’ recommendation to carry over confidence bounds across doubling trick restarts