Stochastic Bandit

Learning objective $S_n = \sum_{t = 1}^n X_t$ which is not optimization problem:

  • What’s the value of $n$
  • Cumulative reward is a random quantity and what the utility on distributions of $S_n$
  • What is the distributions that govern the rewards for each arm

Bandit Instance $v = (P_a:a\in \mathcal A)$, $P_a \mapsto X$ where $X$ is reward.

Unstructured Bandits

\[\mathcal{E} = \{ v = (P_a: a\in \mathcal A):P_a \in \mathcal M_a ,\ \forall a \in \mathcal A \}\]

A specific learner might only works in specific environment.

Structured Bandits

e.g:

\[\mathcal E = \{(\mathcal B(\theta), \mathcal B(1-\theta)) \}\]

Regret

Def. The deficit suffered by the learner relative to the optimal policy

Let $v = (P_a: a \in \mathcal A)$:

\[\mu_a(v) = \int_{\infty}^{\infty} x dP_a(x)\]

and $\mu^*(v) = \max_{a\in\mathcal A} \mu_a(v)$ be the largest mean of all arms.

\[R_n(\pi,v) = n \mu^*(v) - \mathbb E\left[ \sum_{t=1}^n X_t\right]\]

Lemma:

Let $v$ be a stochastic bandit environment. Then, (a) $R_n(\pi, ν) ≥ 0$ for all policies $π$; (b) the policy $\pi$ choosing $A_t \in \argmax_a \mu_a$ for all $t$ satisfies $R_n(π, ν) = 0$; and (c) if $R_n(π, ν) = 0$ for some policy $π$, then $P (\mu_{A_t} = \mu^∗) = 1$ for all $t ∈ [n]$.

  • A sublinear regret

    \[\forall v\in \mathcal E, \lim_{n\rightarrow \infty} \frac{R_n(\pi,v)}{n} = 0\]

Or

\[R_n(\pi, v) \le Cn^p\]

or

\[R_n(\pi,v) \le C(v) f(n)\]

Bayesian Regret

\[\mathrm{BR}_n(\pi,Q) = \int_\mathcal ER_n(\pi,v)d Q(v)\]

Decomposing the Regret

\[T_a(t) = \sum_{s = 1}^t \mathbb I\{ A_s = a\}\]

since $A_t$ depends on the rewards observed which will be random.

Lemma 4.5 (Regret decomposition lemma)

\[R_n = \sum_{a \in \mathcal A} \Delta_a \mathbb E[T_a(n)],\]

where $\Delta_a(v) = \mu^*(v) - \mu_a(v)$ and $\mu_a(v)$ dentoes a expected reward of $a$.

\[\mathbb E[(\mu^* - X_t) \mathbb I\{ A_t = a\} | A_t ] = \mathbb I\{A_t =a \} \mathbb E[\mu^* - X_t | A_t]\]