## A Short Introduction to Boosting

Schapire, R. E., & Freund, Y. (2009). A Short Introduction to Boosting. Society, 14(5), 771–780. https://doi.org/10.1.1.112.5912

Let a training set be denoted by $\{(x_i,y_i)\}^m$, where $x_i \in$ domain/instance space, and $y_i \in$ label set Y

• AdaBoost calls a weak learner in a series of rounds $t = 1, 2, \ldots, T$ and maintains a weight $D_t(i)$ on ith training example.
• The weak learner must find a weak hypothesis $h_t: X \rightarrow \{-1, 1\}$, whose eror is defined by:
$\epsilon_t = \sum_{i:h_t(x_i) \neq y_i} D_t(i)$
• Each $h_t$ has an associated $\alpha_t$ given by:
$\alpha_t = \frac 1 2 \ln \left( \frac{1-\epsilon_t}{\epsilon_t} \right)$
• $\alpha_t$ is a weight on $h_t$, which reflects the fact that $h_t$ is a good weak learner if $\epsilon_t$ is close to zero or one ($\epsilon \approx 1 \implies -h_t$ is a good learner)
• Weights are updated according to whether h_t(x_i) is correct, and its confidence in the prediction (measured by $\alpha_t$).
$D_{t+1}(i) = \frac{D_t(i)\exp(-\alpha_t y_i h_t(x_i))}{Z_t}$
• Note that higher $\alpha_t$ causes bigger updates. $Z_t$ is a normalising factor.
• The final prediction is the sign of the weighted sum of hypotheses:
$H(x) = \text{sign} \left( \sum_{t=1}^T \alpha_t h_t(x)\right)$

### Error of AdaBoost

• For $h_t$, let $\epsilon_t = 1/2 - \gamma_t$. $\gamma_t$ measures the improvement over random guessing. It can be shown that the training error of H drops exponentially if each $h_t$ is better than random guessing. Specifically, the error is bounded by:
$\prod_t [2\sqrt{\epsilon_t (1- \epsilon_t)}] \leq \exp(-2\sum_t \gamma_t^2)$
• The generalisation error will depend on the number of boosting rounds T, the sample size m, the training error, and the VC-dimension of the weak-hypothesis space d (a measure of the expressive power of the learners):
$\text{generalisation error} = \hat{P}[H(x) \neq y] + \tilde{O}\left( \sqrt{\frac{Td}{m}}\right)$
• Larger margins on the training set lead to better generalisation. Additional boosting rounds may reduce the margin even though the training error stays at zero.

### Relationship to SVM

• We can approach AdaBoost from a perspective of maximising the minimum margin of any training example i. This can be written with norms as:
$\max_{\boldsymbol{\alpha}} \min_i \frac{(\boldsymbol{\alpha} \cdot \mathbf{h}(x_i))y_i}{\Vert \boldsymbol{\alpha} \Vert \Vert \mathbf{h}(x_i) \Vert}$
• Where $\Vert \boldsymbol{\alpha} \Vert_1 = \sum_t \vert \alpha_t \vert$ and $\Vert \mathbf{h}(x) \Vert_\infty = \max_t \vert h_t(x) \vert$
• SVM has the same goal, but with Euclidean ($\ell_2$ norms) instead of $\ell_1$ and $\ell_\infty$.
• These norms result in very different margins, especially when dimensionality is high.
• AdaBoost is a linear programming problem, whereas SVM is quadratic. This has computational implementations.
• SVM and AdaBoost have different approaches to searching high-dimensional space:
• SVMs can use kernels – low-dimensional functions that can act in high-dimensional space.
• AdaBoost employs greedy searches with weak learners

### Experiments and applications

• AdaBoost is easy to implement, and the only meta-parameter that needs tuning is the number of boosting rounds T.
• It is flexible with any choice of weak learner, though simple weak learners tend to have computational benefits.
• Boosting can identify outliers: they are simply the datapoints with higher weights.
• However, excess noise is very detrimental to AdaBoost because the algorithm will keep trying to learn from the noise.