DART · Reasonable Deviations

Processing math: 100%

DART: Dropouts meet Multiple Additive Regression Tree

Rashmi, K. V., & Gilad-Bachrach, R. (2015). DART: Dropouts meet Multiple Additive Regression Trees, 38. Retrieved from http://arxiv.org/abs/1505.01866

In MART, subsequent predictors typically focus on a small part of the learning problem and have limited predictive power:
- the first trees have significant contributions, but later trees are overspecialised, only affecting a small fraction of training instances
- thus the ensemble is very sensitive to the decisions of the first trees.
DART uses the dropout method from deep neural networks: each boosting round neglects a portion of the trees. DART is thus somewhere between MART and Random Forest – the latter two can be seen as special cases of DART.

The DART algorithm

Let M denote the current model after n iterations such that $M = \sum_{i=1}^n T_i$ , with $T_i$ as the tree learnt during the ith iteration.
DART randomly selects a subset $I \subset \{ 1, \ldots, n\}$ and creates a model $M = \sum_{i \in I} T_i$ .
The new tree is then fit to the negative gradient of the loss, as is the case with MART.
But reintroducing the dropped trees along with this new tree may ‘overshoot’ the target, so the new tree is scaled by $\frac{1}{1 + |D|}$ and dropped trees are scaled by $\frac{|D|}{1 + |D|}$ , where D is the set of dropped trees.
Trees are selected to be dropped with a Binomial-plus-one technique: each existing tree is dropped with probability $p$ , but if none are dropped then one will be randomly selected.

Experiment results

On ranking tasks, DART beat the previous winners (who used MART) by the same margin that the winner beat 5th place.
For regression, DART outperforms MART and Random Forest even when restricted to one drop per round.
For facial recognition, DART had a higher recall but lower accuracy than MART.