- 2. Overview of Supervised Learning
- 10. Boosting and Additive trees
- 10.1 Discrete AdaBoost
- 10.2 Boosting and additive models
- 10.3 Forward Stagewise Additive Modelling (FSAM)
- 10.4 Exponential Loss and AdaBoost
- 10.5 Why Exponential Loss?
- 10.6 Loss functions and robustness
- 10.9 Boosting trees
- 10.10 Numerical Optimisation via Gradient Boosting
- 10.10.1 Steepest Descent
- 10.10.2 Gradient boosting
- 10.11 Right-sized trees for boosting
- 10.12 Regularisation
- 10.13 Interpretation

- Let $X \in \mathbb{R}^p$ be a random input vector, and $Y \in \mathbb{R}$ be a random output variable. The goal of machine learning is to find $f(X)$ to predict
*Y*. - Training an algorithm requires a
**loss function**$L(Y, f(X))$, such as the squared error loss $(Y-f(X))^2$ - Formally, we will want to minimise the expected prediction error (EPE). In the case of squared loss, and with the joint distribution of
*X*and*Y*being $g(x,y)$:

- We can
*condition*on*X*as follows:

- We can then minimise the EPE pointwise, finding that the solution is the conditional mean:

- Nearest neighbour methods implement this directly. For a test point
*x*, we average the responses of all the points in the neighbourhood of*x*:

- For a large training sample size, there will be points in the neighbourhood closer to
*x*, and as*k*increases the average will be more stable.

- Let $Y= G(X)$ be a classifier such that $G(x) \in \{-1, 1\}$
- A
**weak classifier**is one whose predictions are only slightly better than random guessing. **Boosting**sequentially applies weak classifiers to modified versions of the data, producing: $G_m(x), m=1,2,\ldots,M$. Then:

- Each training example has a weight $w_i$, which is updated by the algorithm (increased if the classifier fails the prediction)

**The AdaBoost algorithm**

Set$~w_i = 1/N, ~~i=1,2,\ldots, N$

For$~m=1,\ldots, M$\[\epsilon_m = \sum_{\text{wrong}}w_i = \sum_{i=1}^N w_i I(y_i \neq G_m(x_i))\]

Fit$~G_m(x)$ to training data using weights $w_i$Computeerror:

Compute$~\alpha_m = \ln (\frac{1-\epsilon_m}{\epsilon_m})$Set$~w_i \leftarrow w_i \cdot \exp(\alpha_m I(y_i \neq G_m(x_i)))$ for $i = 1,2,\ldots,N$\[G(x) = \text{sign}\left( \sum_{m=1}^M \alpha_mG_m(x)\right)\]

Return

- Boosting fits an
**additive expansion**in a set of**basis functions**. In this case, the basis functions are $G_m(x) \in \{ -1, 1 \}$. More generally, additive expansions take the form:

- $b(x;\gamma_m)$ are the basis functions, real functions of a multivariate
*x*with parameters $\gamma$.- in neural networks, $b(x;\gamma_m)$ is the activation function on the weights
- in trees, $\gamma$ parameterises the splits

- These models are fit by minimising some loss function

- It is hard to explicitly fit this, but we can find a good approximation if we are able to fit a single basis function

- FSAM is a technique that approximates the fitting of an additive model by sequentially fitting its basis functions.
- At each iteration
*m*, we solve for $b(x;\gamma_m)$ and $\beta_m$ then add it to the current expansion $f_{m-1}(x)$ to produce $f_m(x)$. - For squared error loss in regression:

- Thus the term that best fits the previous residuals will be added to the model.

**FSAM algorithm**

Set$~f_0(x) = 0$

For$~m=1,\ldots, M$\[(\beta_m, \gamma_m) = \arg \min_{\beta, \gamma} \sum_{i=1}^N L(y_i, f_{m-1}(x_i)+\beta b(x_i;\gamma))\]

Compute:

Set$~f_m(x)=f_{m-1}(x)+\beta_m b(x;\gamma_m)$

Return$~f(x)$

We can show that AdaBoost is equivalent to FSAM with loss:

\[L(y, f(x)) = \exp(-yf(x))\]For AdaBoost, the basis functions are classifiers $G_m(x) \in \{ -1, 1 \}$. With an exponential loss function in FSAM, we have:

\[\begin{align*} (\beta_m, G_m) &= \arg \min_{\beta, G} \sum_{i=1}^N \exp [-y_i(f_{m-1}(x_i)+ \beta G(x_i))] \\ &= \arg \min_{\beta G} \sum_{i=1}^N \exp[-y_i f_{m-1}(x_i)]\exp[-\beta y_i G(x_i)] \end{align*}\]Because $\exp[-y_i f_{m-1}(x_i)]$ is independent of $\beta$ and *G*, it can be regarded as a weight on the observations for that particular iteration $w_i^{(m)}$.

To find the minimising $\beta$ and *G*, we manipulate the sum:

By observation, the minimising *G*:

We can find $\beta$ by differentiating and setting to zero, giving us:

\[\beta_m = \frac 1 2 \ln\left(\frac{1-\epsilon_m}{\epsilon_m} \right)\]where $\epsilon_m$ is the weighted error. All together, this is equivalent to saying that:

\[w_i^{(m+1)} = w_i^{(m)} \exp(-\beta_m y_i G_m(x_i))\]which is the same as in AdaBoost.

- Exponential loss leads to the elegant reweighting scheme of AdaBoost
- It estimates half the log odds:

- This justifies using its sign as the classification rule

- The term $yf(x)$ is called the
**margin**: observations with positive margin are classified correctly. - Therefore loss functions should penalise negative margins
- The exponential loss function concentrates heavily on large negative margins. If there is noise, or a misspecification of class labels, the exponential loss function will give undue credit to the outlier.
- An alternative to exponential loss is the
**binomial deviance**, given by:

- Although the population minimisers of exponential loss and binomial deviance are the same, the binomial deviance has a penalty which is roughly linear in the size of the margin – it is more
**robust**to outliers. - For regression, squared error severely degrades performance when outliers are introduced. Instead,
**Huber loss**can be used: it behaves like squared error loss near zero, but is linear elsewhere. - Although exponential and squared error losses are not robust, they lead to elegant models like AdaBoost.
**Gradient boosting**can produce elegant algorithms for*any*differentiable loss function.

- Decision trees partition the feature space into $R_j, j=1,2,\ldots,J$ with a constant $\gamma_j$ in each region such that:

- $\gamma_j$ is typically just the mean or mode of $y_i$ in $R_j$.
- Our notation for a decision tree paramaterised by $\Theta= \{R_j, \gamma_j\}_1^J$ is then:

- We can estimate the parameters as follows:

- The first term is the empirical risk for a given set of splits, which is very difficult to optimise. We thus approximate with an easier optimisation criterion.
- The
**boosted tree model**is the sum of such trees using FSAM:

- At each step
*m*, FSAM must solve:

- Substituting the exponential loss would lead to something very similar to AdaBoost.

It is possible to numerically find fast approximations for any differentiable loss function. Generally, we want to minimise a loss function:

\[L(f(x)) = \sum_{i=1}^N L(y_i, f(x_i))\]This can be treated as a numerical optimisation problem

\[\hat{\mathbf{f}} = \arg \min_{\mathbf{f}}L(\mathbf{f})\]with ‘parameters’ $\mathbf{f} \in \mathbb{R}^N$ being the approximating function $f(x_i)$ at each of the *N* data points, i.e:

Numerical optimisation procedures solve:

\[\mathbf{f}_m = \sum_{m=0}^M \mathbf{h}_m, \qquad \mathbf{h}_m \in \mathbb{R}^N\]$\mathbf{f}_0 = \mathbf{h}_0$ is the initial guess, then each subsequent $\mathbf{f}_m$ is incremented with a **step** $\mathbf{h}_m$. Different algorithms choose this step differently.

In steepest descent, we choose $\mathbf{h}_m = -\rho_m \mathbf{g}_m$, where $\rho_m$ is the scalar **step length** and $\mathbf{g}_m \in \mathbf{R}^N$ is the gradient of $L(\mathbf{f}_{m-1})$.

The step length can either be constant, or variable as below:

\[\rho_m = \arg \min_\rho L(\mathbf{f}_{m-1} - \rho \mathbf{g}_m)\]Then the update is simply:

\[\mathbf{f}_m = \mathbf{f}_{m-1} - \rho_m \mathbf{g}_m\]Steepest descent is very greedy in that it has no consideration for global optima: the only thing that matters is the steepest descent in this iteration. Additionally, it only looks at the training set. However, it is computationally efficient and can work for any differentiable loss function.

FSAM is similar to steepest descent; the new tree added at each iteration can be thought of as the gradient update.

The goal of boosted trees is to minimise

\[\sum_{i=1}^N L(y_i, f_{m-1}(x) + T(x_i;\Theta_m))\]But this is difficult for robust loss functions. We can attempt to combine the benefits of steepest descent with those of boosted trees by *fitting new trees to the (negative) gradient* (called the **pseudoresidual**)

This is a proxy method: fitting trees to the gradient (which points in the direction of decreasing loss) should decrease the loss without explicitly having to minimise the loss function. This is computationally efficient.

**Gradient tree boosting algorithm**

Set$~f_0(x) = \arg \min_\gamma \sum_{i=1}^N L(y_i, \gamma)$

For$~m=1,\ldots, M$\[r_{im} = -\frac{\partial L(y_i, f_{m-1}(x_i))}{\partial f_{m-1}(x_i)}\]

For$~i=1, \ldots, N$

Compute:\[R_{jm}, \qquad j = 1,2,\ldots,J_m\]

Fittree to pseudoresiduals $r_{im}$ giving:\[\gamma_{jm} = \arg \min_\gamma \sum_{x_i \in R_{jm}} L(y_i, f_{m-1}(x_i) + \gamma)\]

For$~j=1,\ldots,J_m$:

Compute:

Set$~f_m(x)=f_{m-1}(x)+ \sum_{j=1}^{J_m} \gamma_{jm} I(x \in R_{jm})$

Return$~f_M(x)$

- Large trees degrade the performance of boosting
- A good strategy is to restrict all trees to be the same size $J_m$
- Consider the ANOVA (Analysis of Variance) expansion of the target function $\eta = \arg \min_f E_{XY}L(Y, f(X))$, which represents various interactions within $X^T = (X_1, X_2, \ldots, X_p)$:

- Most problems involve only low order interactions, but large trees favour higher-order terms.
- The tree size $J$ limits the order; $J-1$ is the maximum:
- $J=2$ (decision stump) only captures main effects
- $J=3$ allows two variable interaction

- In general, choose $4 \leq J \leq 8$

- Each iteration usually reduces the training risk (it can be made arbitrarily small). However, there is an optimal $M^*$ for generalisation that can be estimated with CV.
- We can also use
**shrinkage**to scale down the contribution from new trees:

- $\nu$ can be regarded as the learning rate. Empirically, it works best if we set $\nu < 0.1$ and compensate with larger
*M*(chosen with early stopping). **Stochastic gradient boosting**grows the next tree on a subsample of training observations (typically about half). Such subsampling only performs well when combined with shrinkage.

For a single tree, the (square) importance of variable $X_l$ is given by summing over all nodes where $l$ is the splitting variable the decrease in error caused by that split:

\[I_l^2(T) = \sum_{t=1}^{J-1}\hat{i}_t^2 I(v(t)=l)\]This is generalised to additive trees by averaging:

\[I_l^2 = \frac 1 M \sum_{m=1}^M I_l^2 (T_m)\]These *I* measures are relative: convention is to set the maximum importance to 100 then scale the rest accordingly.