In my previous article, (All you need to know about Decision Tree Part 1), we have discussed on how a decision tree looks like, its terminology and we also have seen an example of a decision tree.
Now that you got a basic idea on decision tree, in this article, I will discuss on some of the major concepts of Decision tree. I’ve divided these concepts which are involved in constructing and implementing decision tree in three major questions. They are:
- How does a tree split? And how it knows where to split?
- How over fitting can be avoided in decision tree?
- How do I implement decision tree using R?
Let’s begin with first question, how does a tree split? And how it knows where to split?
There are many strategic algorithms which are used in decision tree to decide to split a node into two. This decision of making strategic split is highly important in decision tree model, because it has a greater impact on the tree’s accuracy. The decision of splitting a node into two or more sub-nodes should increase the homogeneity of resultant sub-nodes, which mean the purity of the node should increase with respect to target variable. In decision tree model, tree splits the node in all possible variables but select the split which increases the purity of sub-node.
Yes, I understand, you might be wondering what these pure and impure nodes are?
Pure nodes are the nodes in which all the samples in that node have same class labels, which means no further split is needed.
Opposite of it, the samples in impure node does not have same class label, they need to split further to achieve it.
For explaining our first question, let us consider a dummy bank credit data set. In this data set, we got 7 attributes, in which 1 attribute is serial number and other is the target variable. So, we are left with 5 other attributes using which we need to construct a tree.
Based on the type of the target variable, the algorithm used for splitting changes. Depending on the type of dependent variable, Decision tree is classified into two types:
- Regression Tree: When the dependent variable is continuous, such trees are classified as Regression trees. In regression tree, the value obtained in the leaf node is the mean of all the observations falling in that node.
- Classification tree: If the dependent variable is categorical then it is a classification tree. In this case, the class obtained in the leaf node is the mode of all the observations falling in that node.
In this article, let us look into two of most commonly used measures for splitting a tree.
- Gini Index: Gini Index is an impurity measure used in building Decision tree. According to it, if two items are selected from a population at random then the probability that they both belong to a same class is 1.
- It works on categorical target variable.
- It is used for performing binary split.
- Gini index is directly proportional to purity of sub-nodes i.e. higher the value of Gini Index greater the purity.
Formula for calculating Gini Index=
Considering the above data set, we know that target variable is class, the remaining variables are age, income, marital status and gender. So the root node has to one of these 5.
In choosing the node, which is better as root node? Gender or marital status?
Overall defaulter => 5 => 5/10 = ½
Gini index for Male => (3/5)2 + (2/5)2= 0.36 (p= 3/5, q= 1-p =2/5)
Gini index for Female => (2/5)2 + (3/5)2 = 0.36
Weighted Gini index => 5/10*0.36+5/10*0.36 = 0.36
Gini index for married => (4/7)2 + (3/7)2 = 0.18
Gini index for unmarried => (1/3)2 + (2/3)2 = 0.44
Weighted Gini index => 7/10*0.18 + 3/10* 0.44= 0.13
So, here, Gini value for ‘Gender’ is more than ‘Marital status’.
Similarly, we can calculate for Gini score for remaining variables and node split will takes place on the variable with highest Gini score.
Now, let’s check for income
Overall defaulter => 5 => 5/10 = ½
Income > 36,000 = 3 (3 has income more the 36,000, of which ‘0’ are defaulters) and Income < = 36,000 = 7
Gini score for income < 36000 = (5/7)2 + (2/7)2 = 0.082
Gini score for income > 36000 = (0/3)2 + (3/3)2 = 1
Weighted Gini Score = (7/10)*0.082 + (3/10)*1 = 0.36
- Information gain : According to information gain theory, less impure node requires less information and more impure node requires more information to describe it. Information theory measure this degree of disorganization which is known as Entropy. If the sample is completely pure, then Entropy is zero and if the sample is equally divided then the Entropy is one.
Formula for calculating Entropy = -plog2p – qlog2q
Here, p and q are probability of success and failure.
Let me explain this using credit data set.
In choosing the node, which is better as root node? Gender or Own house?
Entropy for female => – (3/5)log2(3/5) – (2/5)log2(2/5) = 0.44+0.53 = 0.97
Entropy for male => – (2/5)log2(2/5) -(3/5)log2(3/5) = 0.97
Weighted Entropy => (5/10)* 0.97+ (5/10)* 0.97 = 0.97
Entropy for own house => – (4/6)log2(4/6) -(2/6)log2(2/6) = 0.39+0.528 =0.92
Entropy for non-own house => – (1/4)log2(1/4) -(3/4)log2(3/4) =0.5+0.32=0.82
Weighted Entropy = (6/10)*0.92 + (4/10)*0.82 = 0.88
The lesser the entropy value is the better split. Therefore, from the above, ‘Own house’ variable is better variable to split.
Similarly entropy for all other variables are calculated and the variable with least entropy is chosen first to split.
Now that we have learnt how tree is split, but how long should the tree grow, should be our next question? Should the tree always be grown till its maximum?
The answer is ‘NO’. While creating a decision tree, we are giving a set of training samples to build the model. If the constructed model perfectly fit the training data that means the model is not generic because it is totally biased with the training set. This problem is called as Over-fitting. This is our second question.
By book, Over-fitting is the phenomenon in which the learning system tightly fits the given training data so much that it would be inaccurate in predicting the outcomes of the untrained data.
How can we avoid Over-fitting of the model? There are many approaches to prevent Over-fitting. Here, we shall discuss two of such approaches. They are:
- By adding constraints
1. By adding constraints:
A tree can be controlled by adding and setting constraints while constructing a model. This can be done by setting various parameters. These parameters which are used for controlling the tree size are called Hyper-Parameters
Some examples of constraints are:
- Minimum samples for node split: It is the minimum number of samples (or observations) which are required in a node to be considered for splitting.
- Minimum samples at leaf node: Minimum leaf-size is a limit to split a node when the number of observations in one of the child nodes is equal to the minimum leaf-size.
- Maximum depth of tree: Maximum allowed depth of trees or the maximum number of levels the tree can grow. It reduces the complexity of trees and leads to lower chance of over-fitting.
- Maximum number of leaf nodes: By setting this hyper parameter, we can control the number of leaf nodes that are created.
- Maximum features to consider for a split: It is the maximum number of features that are allowed to try in building a tree.
Pruning is the process of finding an optimal tree size. It is a technique used for reducing the tree size and help to avoid over-fitting.
When a tree is left to grow to its limit, it grows taller but looses strength. To avoid this, we cut the tree at a required size to keep it healthy and strong. Similarly, in case of decision tree if we leave the tree to grow to its maximum, this results in over-fitting of the model. Hence, we prune it to make it a generic model which can be used for predicting test data.
Basically, there are two types of Pruning techniques. They are:
- Pre-Pruning : This technique cut the tree first, even before the training set is perfectly classified.
- Post-Pruning: In this technique, the tree is left to grow to its max and then the leaves are cut till the training set is perfectly classified. In other words, the leaves are cut which are increasing the error.
In the above techniques, Post-pruning is the most used and successful technique for avoiding over-fitting.
The above two are ways used for preventing over-fitting of the model. First method (Constraints), generally referred as greedy approach as we are controlling the growth even before the training data is perfectly classified.
In my next post, I’ll try to construct a decision tree using R.