Elements of Statistical learning – Hastie, Tibshirani, Friedman

2. Overview of Supervised Learning

Statistical decision theory

\[\begin{align*} EPE(f) &= E(Y-f(X))^2 \\ &= \int \int [y-f(x)]^2g(x,y)dxdy \end{align*}\] \[\begin{align*} E(L(x,y)) &= \int \left[ \int L(x,y)g(y|x)dy \right]g(x)dx \\ &= E_X \left( \int L(x,y) g(y|x)dy) \right) \\ &= E_XE_{Y|X}(L(X,Y)|X) \end{align*}\] \[\begin{align*} f(x) = \arg \min_{c} E_{Y|X}([Y-c]^2|X=x) \\ \implies f(x) =E(Y|X=x) \end{align*}\] \[\hat{f}(x) = \text{Ave}(y_i|x_i \in N_k(x))\]

10. Boosting and Additive trees

10.1 Discrete AdaBoost

\[G(x) = \text{sign} \left(\sum_{m=1}^M \alpha_m G_m(x)\right)\]

The AdaBoost algorithm

Set $~w_i = 1/N, ~~i=1,2,\ldots, N$

For $~m=1,\ldots, M$

\[\epsilon_m = \sum_{\text{wrong}}w_i = \sum_{i=1}^N w_i I(y_i \neq G_m(x_i))\]

Return

\[G(x) = \text{sign}\left( \sum_{m=1}^M \alpha_mG_m(x)\right)\]

10.2 Boosting and additive models

\[f(x) = \sum_{m=1}^M \beta_m b(x;\gamma_m)\] \[\min_{\{\beta_m,\gamma_m\}_1^M} \sum_{i=1}^N L\left(y_i, \sum\nolimits_{m=1}^M \beta_m b(x;\gamma_m)\right)\] \[\min_{\beta, \gamma} \sum_{i=1}^N L(y_i, \beta b(x_i;\gamma))\]

10.3 Forward Stagewise Additive Modelling (FSAM)

\[\begin{align*} L(y_i, f_{m-1}(x_i)+\beta b(x_i;\gamma)) &= (y_i-f_{m-1}(x)-\beta b(x_i;\gamma))^2 \\ &= (r_{im} - \beta b(x_i;\gamma))^2 \end{align*}\]

FSAM algorithm

Set $~f_0(x) = 0$

For $~m=1,\ldots, M$

\[(\beta_m, \gamma_m) = \arg \min_{\beta, \gamma} \sum_{i=1}^N L(y_i, f_{m-1}(x_i)+\beta b(x_i;\gamma))\]

Return $~f(x)$

10.4 Exponential Loss and AdaBoost

We can show that AdaBoost is equivalent to FSAM with loss:

\[L(y, f(x)) = \exp(-yf(x))\]

For AdaBoost, the basis functions are classifiers $G_m(x) \in \{ -1, 1 \}$. With an exponential loss function in FSAM, we have:

\[\begin{align*} (\beta_m, G_m) &= \arg \min_{\beta, G} \sum_{i=1}^N \exp [-y_i(f_{m-1}(x_i)+ \beta G(x_i))] \\ &= \arg \min_{\beta G} \sum_{i=1}^N \exp[-y_i f_{m-1}(x_i)]\exp[-\beta y_i G(x_i)] \end{align*}\]

Because $\exp[-y_i f_{m-1}(x_i)]$ is independent of $\beta$ and G, it can be regarded as a weight on the observations for that particular iteration $w_i^{(m)}$.

\[\therefore (\beta_m, G_m) = \arg \min_{\beta G} \sum_{i=1}^N w_i^{(m)} \exp[-\beta y_i G(x_i)]\]

To find the minimising $\beta$ and G, we manipulate the sum:

\[\sum_{i=1}^N w_i^{(m)} \exp[-\beta y_i G(x_i)] = e^{-\beta} \sum_{y_i = G(x_i)} w_i^{(m)} + e^\beta \sum_{y_i \neq G(x_i)}w_i^{(m)}\] \[=(e^\beta - e^{-\beta})\sum_{i=1}^N w_i^{(m)} I(y_i \neq G(x_i)) + e^{-\beta} \sum_{i=1}^N w_i^{(m)}\]

By observation, the minimising G:

\[G_m = \arg \min_{G} \sum_{i=1}^N w_i^{(m)} I(y_i \neq G(x_i))\]

We can find $\beta$ by differentiating and setting to zero, giving us:

\[\beta_m = \frac 1 2 \ln\left(\frac{1-\epsilon_m}{\epsilon_m} \right)\]

where $\epsilon_m$ is the weighted error. All together, this is equivalent to saying that:

\[w_i^{(m+1)} = w_i^{(m)} \exp(-\beta_m y_i G_m(x_i))\]

which is the same as in AdaBoost.

10.5 Why Exponential Loss?

\[E(\exp[{-yF(x)}]|x) = P(y=1|x)e^{-F(x)} + P(y=-1|x)e^{F(x)}\] \[\frac{\partial E(\exp[{-yF(x)}]|x)}{\partial F(x)} = 0 \implies f(x)=\frac 1 2 \ln \frac{P(y=1|x)}{P(y=-1|x)}\]

10.6 Loss functions and robustness

\[L(Y, f(x)) = -\ln(1+e^{-2Yf(x)})\]

10.9 Boosting trees

\[x\in R_j \implies f(x) =\gamma_j\] \[T(x;\Theta) = \sum_{j=1}^J \gamma_j I(x \in R_j)\] \[\hat{\Theta} = \arg \min_{\Theta} \sum_{j=1}^J \sum_{x_i \in R_j} L(y_i, \gamma_j) \approx \arg \min_{\Theta} \sum_{i=1}^N \tilde{L}(y_i, T(x_i; \theta))\] \[f_m(x) = \sum_{m=1}^M T(x;\Theta_m)\] \[\hat{\Theta}_m= \arg \min_{\Theta_m} \sum_{i=1}^N L(y_i, f_{m-1}(x)+T(x_i;\Theta_m))\]

10.10 Numerical Optimisation via Gradient Boosting

It is possible to numerically find fast approximations for any differentiable loss function. Generally, we want to minimise a loss function:

\[L(f(x)) = \sum_{i=1}^N L(y_i, f(x_i))\]

This can be treated as a numerical optimisation problem

\[\hat{\mathbf{f}} = \arg \min_{\mathbf{f}}L(\mathbf{f})\]

with ‘parameters’ $\mathbf{f} \in \mathbb{R}^N$ being the approximating function $f(x_i)$ at each of the N data points, i.e:

\[\mathbf{f} = \{f(x_1), f(x_2), \ldots, f(x_N)\}^T\]

Numerical optimisation procedures solve:

\[\mathbf{f}_m = \sum_{m=0}^M \mathbf{h}_m, \qquad \mathbf{h}_m \in \mathbb{R}^N\]

$\mathbf{f}_0 = \mathbf{h}_0$ is the initial guess, then each subsequent $\mathbf{f}_m$ is incremented with a step $\mathbf{h}_m$. Different algorithms choose this step differently.

10.10.1 Steepest Descent

In steepest descent, we choose $\mathbf{h}_m = -\rho_m \mathbf{g}_m$, where $\rho_m$ is the scalar step length and $\mathbf{g}_m \in \mathbf{R}^N$ is the gradient of $L(\mathbf{f}_{m-1})$.

\[\mathbf{g}_m = \nabla L(\mathbf{f}_m-1) \implies g_{im} = \frac{\partial L(y_i, f_{m-1}(x_i))}{\partial f_{m-1}(x_i)}\]

The step length can either be constant, or variable as below:

\[\rho_m = \arg \min_\rho L(\mathbf{f}_{m-1} - \rho \mathbf{g}_m)\]

Then the update is simply:

\[\mathbf{f}_m = \mathbf{f}_{m-1} - \rho_m \mathbf{g}_m\]

Steepest descent is very greedy in that it has no consideration for global optima: the only thing that matters is the steepest descent in this iteration. Additionally, it only looks at the training set. However, it is computationally efficient and can work for any differentiable loss function.

10.10.2 Gradient boosting

FSAM is similar to steepest descent; the new tree added at each iteration can be thought of as the gradient update.

The goal of boosted trees is to minimise

\[\sum_{i=1}^N L(y_i, f_{m-1}(x) + T(x_i;\Theta_m))\]

But this is difficult for robust loss functions. We can attempt to combine the benefits of steepest descent with those of boosted trees by fitting new trees to the (negative) gradient (called the pseudoresidual)

\[\tilde{\Theta}_m = \arg \min_{\Theta} \sum_{i=1}^N (-g_{im} - T(x_i;\Theta))^2\]

This is a proxy method: fitting trees to the gradient (which points in the direction of decreasing loss) should decrease the loss without explicitly having to minimise the loss function. This is computationally efficient.

Gradient tree boosting algorithm

Set $~f_0(x) = \arg \min_\gamma \sum_{i=1}^N L(y_i, \gamma)$

For $~m=1,\ldots, M$

\[r_{im} = -\frac{\partial L(y_i, f_{m-1}(x_i))}{\partial f_{m-1}(x_i)}\] \[R_{jm}, \qquad j = 1,2,\ldots,J_m\] \[\gamma_{jm} = \arg \min_\gamma \sum_{x_i \in R_{jm}} L(y_i, f_{m-1}(x_i) + \gamma)\]

Return $~f_M(x)$

10.11 Right-sized trees for boosting

\[\eta(X) = \sum_j \eta_j(X_j) + \sum_{jk} \eta_{jk}(X_j, X_k) + \sum_{jkl} \eta_{jkl} (X_j, X_k, X_l)\]

10.12 Regularisation

\[f_m(x) = f_{m-1}(x) + \nu \sum_{j=1}^{J_m} \gamma_{jm} I(x \in R_{jm})\]

10.13 Interpretation

For a single tree, the (square) importance of variable $X_l$ is given by summing over all nodes where $l$ is the splitting variable the decrease in error caused by that split:

\[I_l^2(T) = \sum_{t=1}^{J-1}\hat{i}_t^2 I(v(t)=l)\]

This is generalised to additive trees by averaging:

\[I_l^2 = \frac 1 M \sum_{m=1}^M I_l^2 (T_m)\]

These I measures are relative: convention is to set the maximum importance to 100 then scale the rest accordingly.