Introduction to Statistical Learning – James, Witten, Hastie, Tibshirani

So far I have only looked at Chapter 8, on tree-based methods, because I needed specifically to learn more about that. Because the boosting method wasn’t sufficiently detailed, I turned to Elements of Statistical Learning (the big brother of this textbook)

8. Tree-based methods

8.1.1 Regression trees

\[R_1(j, s) = \left\{X | X_j <s\right\} \qquad R_2(j,s) = \left\{X | X_j \geq s\right\}\] \[RSS = \sum_{i: x_i \in R_1}(y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2}(y_i - \hat{y}_{R_2})^2\] \[\sum_{m=1}^{|T|} \sum_{i: x_i \in R_m}(y_i - \hat{y}_{R_m})^2 + \alpha |T|\]

$\hat{y}_{R_m}$ is the subset of the predictor space corresponding to the mth terminal node.

Decision tree algorithm:

  1. Grow tree with recursive binary splitting until each terminal node has fewer than k observations, where k is a preset minimum.
  2. Obtain a sample of subtrees for different values of $\alpha$
  3. K-fold CV to choose $\alpha$
  4. Return tree

8.1.2 Classification trees

After partitioning, the response for a new test example is the mode of the training responses in that particular terminal node. The classification error rate is only used when pruning. For deciding There are two errror functions that are used for decision tree classifiers:

Gini index

\[G = \sum_{k=1}^k \hat{p}_{mk} (1 - \hat{p}_{mk}) = 1- \sum_{k=1}^k \hat{p}_{mk}^2\]

Entropy

\[H = -\sum_{k=1}^K \hat{p}_{mk} \log \hat{p}_{mk}\]

There will sometimes be cases where both branches of a final split result in the same class being predicted. This reflects different confidences, as the split must have increased purity.

8.2.1 Bagging

\[\hat{f}_{avg}(x)= \frac 1 B \sum_{b=1}^B \hat{f}{}^b(x)\]

8.2.2 Random Forests

8.2.3 Boosting regression trees

  1. Set $\hat{f}(x) =0$ and $r_i$ = $y_i$, where $r_i$ is the residual
  2. For $b = 1,2,\ldots,B$:
    • Fit tree $\hat{f}{}^b$ with $d$ splits onto the training data
    • Update $\hat{f}$: $\hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}{}^b(x)$
    • Update residuals $r_i \leftarrow r_i - \lambda \hat{f}{}^b(x_i)$
  3. Output $\hat{f}(x) = \sum_{b=1}^B \lambda \hat{f}{}^b(x)$