Basics, R, Supervised Learning

All you want to know about Decision Tree Part 3

Analytics defined: pruned tree

Decision Tree Part 3

This is the third article in the decision tree series, you can access other two here:

Part 1: All you need to know about Decision Tree Part 1)

Part 2: All you need to know about Decision Tree Part 2)

In this previous article, I tried to construct a decision tree using R. For this, I have considered ‘Ionosphere‘ dataset for constructing a decision tree. This dataset is available in package ‘mlbench’. The dataset contains 35 variables with 351 observations

Required libraries for constructing Decision tree

# Library for constructing a decision tree
library(rpart)
# Ionosphere dataset is available in 'mlbench' library
library(mlbench)
# Used for splittin dataset into train and test data
library(caTools)

Dataset Summary

data('Ionosphere')
summary(Ionosphere)

##  V1      V2            V3                V4                 V5         
##  0: 38   0:351   Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000  
##  1:313           1st Qu.: 0.4721   1st Qu.:-0.06474   1st Qu.: 0.4127  
##                  Median : 0.8711   Median : 0.01631   Median : 0.8092  
##                  Mean   : 0.6413   Mean   : 0.04437   Mean   : 0.6011  
##                  3rd Qu.: 1.0000   3rd Qu.: 0.19418   3rd Qu.: 1.0000  
##                  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000  
##        V6                V7                V8                 V9          
##  Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.00000  
##  1st Qu.:-0.0248   1st Qu.: 0.2113   1st Qu.:-0.05484   1st Qu.: 0.08711  
##  Median : 0.0228   Median : 0.7287   Median : 0.01471   Median : 0.68421  
##  Mean   : 0.1159   Mean   : 0.5501   Mean   : 0.11936   Mean   : 0.51185  
##  3rd Qu.: 0.3347   3rd Qu.: 0.9692   3rd Qu.: 0.44567   3rd Qu.: 0.95324  
##  Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.00000  
##       V10                V11                V12          
##  Min.   :-1.00000   Min.   :-1.00000   Min.   :-1.00000  
##  1st Qu.:-0.04807   1st Qu.: 0.02112   1st Qu.:-0.06527  
##  Median : 0.01829   Median : 0.66798   Median : 0.02825  
##  Mean   : 0.18135   Mean   : 0.47618   Mean   : 0.15504  
##  3rd Qu.: 0.53419   3rd Qu.: 0.95790   3rd Qu.: 0.48237  
##  Max.   : 1.00000   Max.   : 1.00000   Max.   : 1.00000  
##       V13               V14                V15               V16          
##  Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.00000  
##  1st Qu.: 0.0000   1st Qu.:-0.07372   1st Qu.: 0.0000   1st Qu.:-0.08170  
##  Median : 0.6441   Median : 0.03027   Median : 0.6019   Median : 0.00000  
##  Mean   : 0.4008   Mean   : 0.09341   Mean   : 0.3442   Mean   : 0.07113  
##  3rd Qu.: 0.9555   3rd Qu.: 0.37486   3rd Qu.: 0.9193   3rd Qu.: 0.30897  
##  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.00000  
##       V17               V18                 V19         
##  Min.   :-1.0000   Min.   :-1.000000   Min.   :-1.0000  
##  1st Qu.: 0.0000   1st Qu.:-0.225690   1st Qu.: 0.0000  
##  Median : 0.5909   Median : 0.000000   Median : 0.5762  
##  Mean   : 0.3819   Mean   :-0.003617   Mean   : 0.3594  
##  3rd Qu.: 0.9357   3rd Qu.: 0.195285   3rd Qu.: 0.8993  
##  Max.   : 1.0000   Max.   : 1.000000   Max.   : 1.0000  
##       V20                V21               V22           
##  Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.000000  
##  1st Qu.:-0.23467   1st Qu.: 0.0000   1st Qu.:-0.243870  
##  Median : 0.00000   Median : 0.4991   Median : 0.000000  
##  Mean   :-0.02402   Mean   : 0.3367   Mean   : 0.008296  
##  3rd Qu.: 0.13437   3rd Qu.: 0.8949   3rd Qu.: 0.188760  
##  Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.000000  
##       V23               V24                V25               V26          
##  Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.00000  
##  1st Qu.: 0.0000   1st Qu.:-0.36689   1st Qu.: 0.0000   1st Qu.:-0.33239  
##  Median : 0.5318   Median : 0.00000   Median : 0.5539   Median :-0.01505  
##  Mean   : 0.3625   Mean   :-0.05741   Mean   : 0.3961   Mean   :-0.07119  
##  3rd Qu.: 0.9112   3rd Qu.: 0.16463   3rd Qu.: 0.9052   3rd Qu.: 0.15676  
##  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.00000  
##       V27               V28                V29               V30          
##  Min.   :-1.0000   Min.   :-1.00000   Min.   :-1.0000   Min.   :-1.00000  
##  1st Qu.: 0.2864   1st Qu.:-0.44316   1st Qu.: 0.0000   1st Qu.:-0.23689  
##  Median : 0.7082   Median :-0.01769   Median : 0.4966   Median : 0.00000  
##  Mean   : 0.5416   Mean   :-0.06954   Mean   : 0.3784   Mean   :-0.02791  
##  3rd Qu.: 0.9999   3rd Qu.: 0.15354   3rd Qu.: 0.8835   3rd Qu.: 0.15407  
##  Max.   : 1.0000   Max.   : 1.00000   Max.   : 1.0000   Max.   : 1.00000  
##       V31               V32                 V33         
##  Min.   :-1.0000   Min.   :-1.000000   Min.   :-1.0000  
##  1st Qu.: 0.0000   1st Qu.:-0.242595   1st Qu.: 0.0000  
##  Median : 0.4428   Median : 0.000000   Median : 0.4096  
##  Mean   : 0.3525   Mean   :-0.003794   Mean   : 0.3494  
##  3rd Qu.: 0.8576   3rd Qu.: 0.200120   3rd Qu.: 0.8138  
##  Max.   : 1.0000   Max.   : 1.000000   Max.   : 1.0000  
##       V34            Class    
##  Min.   :-1.00000   bad :126  
##  1st Qu.:-0.16535   good:225  
##  Median : 0.00000             
##  Mean   : 0.01448             
##  3rd Qu.: 0.17166             
##  Max.   : 1.00000
actdata <- Ionosphere

Testing and Training dataset creation

# We have split data into 75% train data and 25% test data
samples<-sample.split(actdata$Class,SplitRatio = 0.75)
# Train data
train_set<- subset(actdata,samples==TRUE)
# Test data
test_set<- subset(actdata,samples==FALSE)

Let us try to build a decision tree with the above train data

## rpart is used for constructing a decision tree. Here, we have taken method as class
modeling<- rpart(Class~.,data = train_set,method = 'class')

Plot the tree

plot(modeling)
text(modeling)

Analytics Defined: Decision Tree

Using this model, let us predict the dependent variable of test data. We remove ‘class’ variable from test data in order to use it for prediction

model_predict<-predict(modeling,test_set[,-35],type = 'class')
# let's check the accuracy of the model
score<-mean(model_predict==test_set$Class)
score

## [1] 0.8068182
# For better understanding, I've created a confusion matrix aswell
table(predicted = model_predict,actuals = test_set$Class)

##          actuals
## predicted bad good
##      bad   18    3
##      good  14   53

Hmmm..!! that’s a decent score. Let’s try to increase the score by applying ‘Pruning‘ technique. How will it impact score of the model? Let us check that

For pruning, We need to choose proper pruning parameters also known as cost complexity parameters by choosing the values that result in lowest prediction error

# Cost-complexity pruning
printcp(modeling)

## 
## Classification tree:
## rpart(formula = Class ~ ., data = actdata, method = "class")
## 
## Variables actually used in tree construction:
## [1] V1  V22 V27 V3  V5 
## 
## Root node error: 126/351 = 0.35897
## 
## n= 351 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.547619      0   1.00000 1.00000 0.071327
## 2 0.206349      1   0.45238 0.51587 0.057758
## 3 0.021164      2   0.24603 0.31746 0.047248
## 4 0.010000      5   0.18254 0.39683 0.051969

From this table, we can identify the lowest cross-validation error

# To get index with lowest error
cp<-which.min(modeling$cptable[,'xerror'])
cpt<-modeling$cptable[cp,'CP']

# Prune the with obtained cost complexity parameters
pruned_modeling<-prune(modeling,cpt)

# Plot the pruned model
plot(pruned_modeling)
text(pruned_modeling)

Analytics defined: pruned tree

Now, let us check whether pruning has made an impact on the score or not.

pruned_predict<-predict(pruned_modeling,test_set[,-35],type = 'class')
pruned_score<-mean(pruned_predict==test_set$Class)
pruned_score

## [1] 0.8295455

Yes, the score has increased significantly. Pruning does work well.

If you feel the plots of the decision tree are boring, you can fancy up the plot with package ‘rattle’. Try yourself and see how fancier the trees can be made. Explore the package.

Tagged , , , ,