Introduction to Boosting · Reasonable Deviations

A Short Introduction to Boosting

Schapire, R. E., & Freund, Y. (2009). A Short Introduction to Boosting. Society, 14(5), 771–780. https://doi.org/10.1.1.112.5912

Let a training set be denoted by $\{(x_i,y_i)\}^m$, where $x_i \in$ domain/instance space, and $y_i \in $ label set Y

AdaBoost

AdaBoost calls a weak learner in a series of rounds $t = 1, 2, \ldots, T$ and maintains a weight $D_t(i)$ on ith training example.
The weak learner must find a weak hypothesis $h_t: X \rightarrow \{-1, 1\}$, whose eror is defined by:

\[\epsilon_t = \sum_{i:h_t(x_i) \neq y_i} D_t(i)\]

Each $h_t$ has an associated $\alpha_t$ given by:

\[\alpha_t = \frac 1 2 \ln \left( \frac{1-\epsilon_t}{\epsilon_t} \right)\]

$\alpha_t$ is a weight on $h_t$, which reflects the fact that $h_t$ is a good weak learner if $\epsilon_t$ is close to zero or one ($\epsilon \approx 1 \implies -h_t$ is a good learner)
Weights are updated according to whether h_t(x_i) is correct, and its confidence in the prediction (measured by $\alpha_t$).

\[D_{t+1}(i) = \frac{D_t(i)\exp(-\alpha_t y_i h_t(x_i))}{Z_t}\]

Note that higher $\alpha_t$ causes bigger updates. $Z_t$ is a normalising factor.
The final prediction is the sign of the weighted sum of hypotheses:

\[H(x) = \text{sign} \left( \sum_{t=1}^T \alpha_t h_t(x)\right)\]

Error of AdaBoost

For $h_t$, let $\epsilon_t = 1/2 - \gamma_t$. $\gamma_t$ measures the improvement over random guessing. It can be shown that the training error of H drops exponentially if each $h_t$ is better than random guessing. Specifically, the error is bounded by:

\[\prod_t [2\sqrt{\epsilon_t (1- \epsilon_t)}] \leq \exp(-2\sum_t \gamma_t^2)\]

The generalisation error will depend on the number of boosting rounds T, the sample size m, the training error, and the VC-dimension of the weak-hypothesis space d (a measure of the expressive power of the learners):

\[\text{generalisation error} = \hat{P}[H(x) \neq y] + \tilde{O}\left( \sqrt{\frac{Td}{m}}\right)\]

Larger margins on the training set lead to better generalisation. Additional boosting rounds may reduce the margin even though the training error stays at zero.

Relationship to SVM

We can approach AdaBoost from a perspective of maximising the minimum margin of any training example i. This can be written with norms as:

\[\max_{\boldsymbol{\alpha}} \min_i \frac{(\boldsymbol{\alpha} \cdot \mathbf{h}(x_i))y_i}{\Vert \boldsymbol{\alpha} \Vert \Vert \mathbf{h}(x_i) \Vert}\]

Where $\Vert \boldsymbol{\alpha} \Vert_1 = \sum_t \vert \alpha_t \vert$ and $\Vert \mathbf{h}(x) \Vert_\infty = \max_t \vert h_t(x) \vert$
SVM has the same goal, but with Euclidean ($\ell_2$ norms) instead of $\ell_1$ and $\ell_\infty$.
These norms result in very different margins, especially when dimensionality is high.
AdaBoost is a linear programming problem, whereas SVM is quadratic. This has computational implementations.
SVM and AdaBoost have different approaches to searching high-dimensional space:
- SVMs can use kernels – low-dimensional functions that can act in high-dimensional space.
- AdaBoost employs greedy searches with weak learners

Experiments and applications

AdaBoost is easy to implement, and the only meta-parameter that needs tuning is the number of boosting rounds T.
It is flexible with any choice of weak learner, though simple weak learners tend to have computational benefits.
Boosting can identify outliers: they are simply the datapoints with higher weights.
However, excess noise is very detrimental to AdaBoost because the algorithm will keep trying to learn from the noise.