## DART: Dropouts meet Multiple Additive Regression Tree

• In MART, subsequent predictors typically focus on a small part of the learning problem and have limited predictive power:
• the first trees have significant contributions, but later trees are overspecialised, only affecting a small fraction of training instances
• thus the ensemble is very sensitive to the decisions of the first trees.
• DART uses the dropout method from deep neural networks: each boosting round neglects a portion of the trees. DART is thus somewhere between MART and Random Forest – the latter two can be seen as special cases of DART.

### The DART algorithm

• Let M denote the current model after n iterations such that $M = \sum_{i=1}^n T_i$, with $T_i$ as the tree learnt during the ith iteration.
• DART randomly selects a subset $I \subset \{ 1, \ldots, n\}$ and creates a model $M = \sum_{i \in I} T_i$.
• The new tree is then fit to the negative gradient of the loss, as is the case with MART.
• But reintroducing the dropped trees along with this new tree may ‘overshoot’ the target, so the new tree is scaled by $\frac{1}{1 + |D|}$ and dropped trees are scaled by $\frac{|D|}{1 + |D|}$, where D is the set of dropped trees.
• Trees are selected to be dropped with a Binomial-plus-one technique: each existing tree is dropped with probability $p$, but if none are dropped then one will be randomly selected.

### Experiment results

• On ranking tasks, DART beat the previous winners (who used MART) by the same margin that the winner beat 5th place.
• For regression, DART outperforms MART and Random Forest even when restricted to one drop per round.
• For facial recognition, DART had a higher recall but lower accuracy than MART.