DART: Dropouts meet Multiple Additive Regression Tree
Rashmi, K. V., & Gilad-Bachrach, R. (2015). DART: Dropouts meet Multiple Additive Regression Trees, 38. Retrieved from http://arxiv.org/abs/1505.01866
- In MART, subsequent predictors typically focus on a small part of the learning problem and have limited predictive power:
- the first trees have significant contributions, but later trees are overspecialised, only affecting a small fraction of training instances
- thus the ensemble is very sensitive to the decisions of the first trees.
- DART uses the dropout method from deep neural networks: each boosting round neglects a portion of the trees. DART is thus somewhere between MART and Random Forest – the latter two can be seen as special cases of DART.
The DART algorithm
- Let M denote the current model after n iterations such that $M = \sum_{i=1}^n T_i$, with $T_i$ as the tree learnt during the ith iteration.
- DART randomly selects a subset $I \subset \{ 1, \ldots, n\}$ and creates a model $M = \sum_{i \in I} T_i$.
- The new tree is then fit to the negative gradient of the loss, as is the case with MART.
- But reintroducing the dropped trees along with this new tree may ‘overshoot’ the target, so the new tree is scaled by $\frac{1}{1 + |D|}$ and dropped trees are scaled by $\frac{|D|}{1 + |D|}$, where D is the set of dropped trees.
- Trees are selected to be dropped with a Binomial-plus-one technique: each existing tree is dropped with probability $p$, but if none are dropped then one will be randomly selected.
Experiment results
- On ranking tasks, DART beat the previous winners (who used MART) by the same margin that the winner beat 5th place.
- For regression, DART outperforms MART and Random Forest even when restricted to one drop per round.
- For facial recognition, DART had a higher recall but lower accuracy than MART.