*Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1023/A:1022643204877*

- Top-Down Induction of Decision Trees (TDIDT) is a family of algorithms which construct trees starting from the root and ending with leaves
- The
**induction task**: for a universe of objects with specific discrete attributes, develop a classification rule from a training set to classify the object as a positive (P) or negative (N) instance - Attributes are
**inadequate**if the same attributes can result in different classes.

- A brute force search for the best tree is not computationally feasible.
- ID3 is iterative: it first trains on a subset of the training data, then some of the incorrectly classified objects in the validation set are added to the training subset.
- ID3 tends to create relatively simple trees
- To from a tree from a collection
*C*of objects, we choose a test*T*with outcomes ${O_1, O_2, \ldots, O_w}$ which will partition*C*into $\{C_1, C_2, \ldots, C_w\}$.*T*is effectively a choice of which attribute we will use to split. - The criterion for splitting is based on
**information**. - One of the assumptions is that
*C*will classify objects either*P*or*N*in the same proportion as the representation of these classes in the training data - Let
*C*contain*p*objects of class*P*and*n*objects of class*N*. The information to generate a ‘P’ or ‘N’ message:

- If we are splitting on variable
*A*which takes values $\{A_1, A_2, \ldots, A_v\}$, let each $C_i$ have $p_i$ objects of class*P*and $n_i$ objects of class*N*. The**expected information**is the weighted average:

- The
**information gain**by branching on*A*is:

- The computational complexity of this process is $O( \vert C \vert \cdot \vert A \vert)$

- To deal with noise:
- the algorithm must be able to work with inadequate attributes
- there must be some criterion to avoid fitting to noise

- If
*A*is an attribute with random values, branching on*A*will still cause an information gain:- however, setting a minimum gain may also exclude relevant attributes
- better is to perform a $\chi ^2$ test of stochastic independence, such that we won’t split on
*A*unless irrelevance can be rejected. If*A*is irrelevant, then our number of positive predictions should obey:

- To deal with inadequate attributes, if we cannot classify an object with our tree, we can just label it with the more numerous class.
- Information gain favours attributes with many values (in the discrete case).We can solve this by requiring that all tests have only 2 outcomes, by choosing a subset of the values to become one branch and the remainders to form the other branch.