Adaptive Algorithms for Infinitely Many-Armed Bandits

Introduction

This Talk

Sequential decision problems
Unbounded set of options
Limited sampling budget

Classical Bandit Problems

Consider a two-armed bandit problem, \(\mathcal A=\{1, 2\}\)

we observe:

\(\newcommand{\and}{\quad \mathrm{and} \quad}\)

\(X_{1,s} = \mu_{1} + \varepsilon_{1,s} \quad \and \quad X_{2,s} = \mu_{2} + \varepsilon_{2,s}\)

\(\mu_1\), \(\mu_2\) belong to \(\mathbb R\)
\(\varepsilon_{i,s} \sim \mathcal N(0,1)\) (or 1-subgaussian)

Active Learning Setting

At each timestep \(t\)

Based on past obsevation
The learner pulls arm \(\hat a_t \in \{1, 2\}\) and get reward \(X_{\hat a_t, s}\)
\(s = N_{\hat a_t,t}\) is the number of times arm \(\hat a_t\) has been pulled

Cumulative Regret

Motivation:

Overall cost of learning
Every suboptimal pull incurs a loss

Defined as

\(R = \sum_{t=1}^n (\mu_1 - \mu_{j_t})\)

UCB Algo

Optimism in face of uncertainty:

Choose arms that maximizes upper-confidence bound

\(UCB_{a,t} \asymp \hat \mu_a + \frac{c}{\sqrt{N_{a,t}}}\)

Illustration of UCB

Upper Bound on Cumulative Regret

Assume \(\mu_1 > \mu_2\), let \(\Delta = \mu_1 - \mu_2\) be the gap

Main idea. If the confidence interval are disjoint, we detect best arm:

Simplified condition.

\(\frac{1}{\sqrt{t}} \lesssim \Delta\)

We detect best arm after time \(t \gtrsim 1/\Delta^2\)

\(R \lesssim \max(\frac{\Delta}{\Delta^2}, \Delta t) \lesssim \sqrt{t}\)

(worst case when \(\Delta \asymp 1/\sqrt{t}\))

Bandits with Infinitely Many Arms

Setting

Let \(\mathcal A = \{1, 2, \dots, \}\).

Observations for arm \(a\) (if ever observed):

\(X_{a,s} = \mu_a + \varepsilon_{a,s}\)

\(\mu_a\) are drown independently from some distribution \(\mathcal D\)
\(\varepsilon_{a,s}\) are independent \(\mathcal N(0,1)\)

Alternative formulation

\(X_{a,s} = \lambda_{\gamma(a)} + \varepsilon_{a,s}\)

\(\eta \mapsto \lambda_{\eta}\) is a decreasing function
\(\gamma(a)\) are iid \(\mathcal U(0,1)\)

\(\rightarrow\) \(\gamma(a)\) represents a continuous rank of arm \(a\)

\(\rightarrow\) “best arm” corresponds to \(\gamma(a)=0\)

\(\rightarrow\) Unbounded distribution of arm means iff \(\lambda_0 = +\infty\)

When Cumulative Regret Fails

Consider the case where \(\lambda_{\gamma(a)} \in \{0, \epsilon\}\) and

\(\mathbb P(\lambda_{\gamma(a)} = \epsilon) = \eta_0\):

\(\lambda_{\eta}= \begin{cases} \epsilon &\text{if } \eta \leq \eta_0\\ 0 &\text{otherwise } \\ \end{cases}\)

De Heide et al. (2021) established that it is impossible to adapt to \(\eta_0\) to get a “good” cumulative regret

It is possible to adapt to \(\eta_0\) if we consider simple regret

Simple Recommendation Reward

At each timestep \(t\), the learner:

Pulls an arm \(\hat a_t \in \mathcal A\)
Recommends an arm \(\hat r_t\)

The simple expected reward is \(\lambda_{\gamma(\hat r_t)}\)

\(\epsilon\)-Good Arms and Simple Regrets

Two main criteria in the litterature:

\(\epsilon\)-good arm: \(\mathbb P(\lambda_{\gamma(\hat r_t)} \geq \lambda_0 - \epsilon) \geq 1-\delta\)
Simple regret: \(\lambda_0 - \lambda_{\hat r_t}\)

\(\to\) These are anytime criteria.

\(\to\) In fixed budget settings \(t \in \{1, \dots, T\}\), simple regret can be defined as \(\lambda_0 - \lambda_{\hat a_t}\)

\(\to\) we need boundedness (\(\lambda_0 < +\infty\))to define them!

Main Question

Given quantile function \(\lambda\),

What simple reward can we achieve?

Theoretical Results

Main Result

Let \(\mathcal D\) be any distribution (potentially unbounded).

Define for \(0< \rho < \nu < 1\):

\(G(\rho, \nu) = \frac{\nu}{\rho} \frac{1}{(\lambda_{\rho}- \lambda_{\nu})^2} \lor \frac{1}{\rho}\)

\(\frac{1}{(\lambda_{\rho}- \lambda_{\nu})^2}\) correspond to the inverse squared gap between ranks \(\nu\) and \(\rho\)
It would correspond to the sample complexity if we had only two arms with mean \(\lambda_\nu\) and \(\lambda_{\rho}\)
For one arm s.t. \(\gamma(a) \leq \rho\), there are \(\nu/\rho\) arms s.t. \(\gamma(a) \leq \nu\)

Main Result

\(G(\rho, \nu) = \frac{\nu}{\rho} \frac{1}{(\lambda_{\rho}- \lambda_{\nu})^2} \lor \frac{1}{\rho}\)

We are able to detect quantile \(\eta\) at time \(t\)

\(\exists \rho < \eta\) s.t. \(\forall \nu \geq \eta\),

\(G(\rho, \nu) \ll t\)

(\(\ll\) hides a polylog factor in \(t\) and error probability \(\delta\))

This results is also valid for unbounded distributions !

Consequences

\(\zeta\): noise level
\(\mathbf{t}=t \cdot (polylog)\)

Type of the distribution \(\mathcal{D}\)	Quantile Function \(\lambda_{\eta}\)	Upper Bound on \(\gamma(\hat r_t)\)	Lower Bound on \(\lambda_{\gamma(\hat r_t)}\)
Bernoulli	\(u\mathbf{1}\{\eta \leq \eta_0\}\)	\(\eta_0 \mathbf{1}\{\mathbf{t} \geq \frac{\zeta^2}{\eta_0 u^2}\}\)	\(u \mathbf{1}\{\mathbf{t} \geq \frac{\zeta^2}{\eta_0 u^2}\}\)
Beta (\(\alpha < 1/2\))	\(1-\eta^{\alpha}\)	\(\frac{1 \lor \frac{\zeta^2}{\alpha^2}}{\mathbf{t}}\)	\(1-\left(\frac{1 \lor \frac{\zeta^2}{\alpha^2}}{\mathbf{t}}\right)^\alpha\)
Beta (\(\alpha \geq 1/2\))	\(1-\eta^{\alpha}\)	\(\frac{1}{\mathbf{t}} \lor \left(\frac{\zeta^2}{\mathbf{t}}\right)^{\frac{1}{2\alpha}}\)	\(1-\frac{1}{\mathbf{t}^\alpha} \lor \sqrt{\frac{\zeta^2}{\mathbf{t}}}\)
Pareto (\(\alpha < 1/2\))	\(\eta^{-\alpha}\)	\(\frac{1}{\mathbf{t}} \lor \left( \frac{\zeta^2}{\alpha^2\mathbf{t}}\right)^{\frac{1}{1-2\alpha}}\)	\(\mathbf{t}^\alpha \land \left(\frac{\alpha^2\mathbf{t}}{\zeta^2}\right)^{\frac{\alpha}{1-2\alpha}}\)
Pareto (\(\alpha \geq 1/2\))	\(\eta^{-\alpha}\)	\(\frac{1}{\mathbf{t}}\lor \mathbf{1}\{\mathbf{t} \geq \zeta^{1/\alpha}\}\)	\(\mathbf{t}^\alpha \mathbf{1}\{\mathbf{t} \geq \zeta^{1/\alpha}\}\)

Algorithms

Litterature

Carpentier & Valko (2015): Assumes that \(\lambda_{\eta} = 1- \eta^{\alpha}\) and estimate \(\alpha\) to adapt
Jamieson et al. (2013): Bracketing trick consisting in increasing disjoint sets of explored arms
Zhao et al. (2023): Sequential halving, doubling trick and bracketing trick

OSE and PROSE

OSE (Optimistic Scope Exploration)
PROSE (Progressive Ranking for Optimistic Scope Exploration)

OSE

\(LCB_{a,t} = \hat \lambda_{\gamma(a)} - \frac{C}{\sqrt{N_{a,t}}} \quad \text{and} \quad UCB_{a,t} \asymp \hat \lambda_{\gamma(a)} + \frac{C}{\sqrt{N_{a,t}}}\)

At each time \(t\):

Draw an exploration scope \(Z \sim t^U\), where \(U \sim \mathcal U(0,1)\)
Pull UCB over \(\{1, \dots, Z\}\), \(\hat a_t = \mathrm{argmax}_{a \leq Z} UCB_{a,t}\)
Recommend LCB over all arms, \(\hat r_t = \mathrm{argmax}_{a \in \mathcal A} LCB_{a,t}\)

There is also a deterministic version (better in practice and computationally more efficient)

Why do we take \(Z \sim t^U\) ?

Assume that for any \(q \in (0,1), \lambda_{q} \asymp \lambda_{q/2}\)
If \(Z \asymp t^{k/\log_2(t)}=2^k\),

\(\exists a \leq Z\) s.t. \(\gamma(a) \lesssim 2^{-k}\)

This happens with non-negligible probability:

\(\mathbb P(t^{k/\log_2(t)} \leq Z \leq t^{k/\log_2(t)}) = 1/\log_2(t)\)

Result on OSE

Recall that for \(\rho < \nu\), we define

\(G(\rho, \nu) = \frac{\nu}{\rho} \frac{1}{(\lambda_{\rho}- \lambda_{\nu})^2} \lor \frac{1}{\rho}\)

Theorem

Let \(\eta \in (0,1)\) and assume that \(G(\rho, \nu) \leq \tfrac{t}{C\log^C(t/\delta)}\).

Then, with probability at least \(1-\delta\), OSE recommends arms \((\hat r_t)_{t \geq 1}\) that satisfy \(\gamma(\hat r_t) \leq \eta\)

Practical Limitations

\(\mathbb P(Z=1) = 1/\log_2(t)\)
Even if we know first arms are bad, we keep pulling them a lot (\(t/\log_2(t)\))
Idea: rank the arms iteratively with a pessimistic metric

PROSE

Start from some arbitrary ranking \(\pi\). For each \(t\):

Draw an exploration scope \(Z \sim t^U\), where \(U \sim \mathcal U(0,1)\)
Pull UCB over \(\{1, \dots, Z\}\), \(\hat a_t = \mathrm{argmax}_{\pi(a) \leq Z} UCB_{a,t}\)
Sort the arms by LCB: \(LCB_{\pi^{-1}(1),t} \geq LCB_{\pi^{-1}(2),t}\dots\)
Recommend LCB over all arms, \(\hat r_t = LCB_{\pi^{-1}(1),t}\)

Naive implementation: time complexity \(\asymp \tilde O(t^2)\)

Deterministic version + cascading trick for max: time complexity \(\asymp \tilde O(t)\)

Numerical Study

Experimental Setting

\(\lambda_{\eta} = 1-\eta^{\alpha}\) with \(\alpha \in \{0.25, 0.5, 1, 2\}\)
\(K = 5000\) arms
\(T_{\mathrm{max}}=50000\)
Mean of simple regret trajectories over \(N=2000\) trials

Main Result

Influence of Tuning Parameter

Cumulative Regret

Implementation Details

PROSE (optimized deterministic version): \(\sim\) 76 ms/trial
BSH (home made implementation): \(\sim\) 89 ms/trial
BSH (original version from ):
- \(\sim 14s\)/trial (slow pace language Python or R).
- ignores the authors’ recommendation to carry over confidence bounds across doubling trick restarts

Adaptive Algorithms for Infinitely Many-Armed Bandits

Introduction

This Talk

Classical Bandit Problems

Active Learning Setting

Cumulative Regret

UCB Algo

Illustration of UCB

Upper Bound on Cumulative Regret

Bandits with Infinitely Many Arms

Setting

Alternative formulation

When Cumulative Regret Fails

Simple Recommendation Reward

\(\epsilon\)-Good Arms and Simple Regrets

Main Question

Related Results

Theoretical Results

Main Result

Main Result

Consequences

Algorithms

Litterature

OSE and PROSE

OSE

Why do we take \(Z \sim t^U\) ?

Result on OSE

Practical Limitations

PROSE

Numerical Study

Experimental Setting

Main Result

Influence of Tuning Parameter

Cumulative Regret

Implementation Details