DART: Dropouts meet Multiple Additive Regression Tree
Rashmi, K. V., & Gilad-Bachrach, R. (2015). DART: Dropouts meet Multiple Additive Regression Trees, 38. Retrieved from http://arxiv.org/abs/1505.01866
- In MART, subsequent predictors typically focus on a small part of the learning problem and have limited predictive power:
- the first trees have significant contributions, but later trees are overspecialised, only affecting a small fraction of training instances
- thus the ensemble is very sensitive to the decisions of the first trees.
- DART uses the dropout method from deep neural networks: each boosting round neglects a portion of the trees. DART is thus somewhere between MART and Random Forest – the latter two can be seen as special cases of DART.
The DART algorithm
- Let M denote the current model after n iterations such that M=∑ni=1Ti, with Ti as the tree learnt during the ith iteration.
- DART randomly selects a subset I⊂{1,…,n} and creates a model M=∑i∈ITi.
- The new tree is then fit to the negative gradient of the loss, as is the case with MART.
- But reintroducing the dropped trees along with this new tree may ‘overshoot’ the target, so the new tree is scaled by 11+|D| and dropped trees are scaled by |D|1+|D|, where D is the set of dropped trees.
- Trees are selected to be dropped with a Binomial-plus-one technique: each existing tree is dropped with probability p, but if none are dropped then one will be randomly selected.
Experiment results
- On ranking tasks, DART beat the previous winners (who used MART) by the same margin that the winner beat 5th place.
- For regression, DART outperforms MART and Random Forest even when restricted to one drop per round.
- For facial recognition, DART had a higher recall but lower accuracy than MART.