As the title suggests, I’ll try to put necessary information on decision tree under this article. However, providing all the required information in one post will be difficult and makes you lost. So, I’ve made this article into three parts.
Part 1 (this post) : we shall discuss introduction and definitions
Part 2 : Advanced topics related to decision tree (splitting, pruning etc.)
Part 3 : Implementing Decision Tree in R
Let us start with part 1, whenever you hear about decision tree, the technical words which you can recall are branches, nodes etc. But rather than jumping directly into the technicality (let’s pretend that you don’t know these terms too), let me take you from scratch.
When you want to talk about Decision tree, you need to know what is Data Mining? Because decision Tree is after all one of the many concepts of Data Mining.
So, what is Data Mining?
Now, you have a set of data, and you want to know what exactly the data represents. For that, you perform some techniques to extract information from that raw data. This process is called Data Mining.
The bookish definition- “Data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis”
In the above definition, I’ve mentioned some techniques used for extracting the data. These techniques are Data analytical techniques or simply machine learning algorithms. Based on the input data, these machine learning algorithms are broadly classified into two types. They are:
Supervised Learning algorithms
In supervised learning algorithms, a set of data called Training data that contains some training examples is given upfront to the machine and make it learn. A supervised learning algorithm analyzes the training data and derives a function from its learning which can be used for predicting the new examples.
Basically, the training data contains the target variable (which is required to find out from the data). In supervised learning algorithms, training data along with target variable is given, using these, the model is built.
Supervised learning problems can be further grouped into regression and classification problems.
Classification problems: A classification problem is when the output variable is a categorical such as ‘Yes’ or ‘ No’, ‘Low’ or ‘High’ etc
Examples: Decision tree, Support Vector machine and so on.
Regression problems: A regression problem is when the output variable is real value such as ‘weight’, ‘price’ and so on
Examples: Linear regression etc.
Unsupervised Learning algorithms
The name itself suggests the definition, there is a no supervised process involved in these algorithms. In unsupervised learning, there is no concept of training data and target variable. The objective of an unsupervised learning algorithm is to identify the underlying structure in the input data.
These algorithms are referred as unsupervised because there is no training data and target variable to teach.
Even unsupervised learning algorithms can be further classified into Clustering and Association. (Let’s not go into these topics now)
Till now we have discussed with which family, the decision tree has evolved, now let’s discuss the details of the decision tree. Let’s start with the basic definition.
What is Decision tree?
A decision tree is a supervised learning algorithm which is widely used for building regression or classification models in the process of data mining. The model is referred as decision tree since the process of obtaining the output represents a tree structure. Using the decision tree, the decision-making process can be easily represented visually.
For example, assume that we have a dataset related to credit risk. We got 5 variables including the target variable in the dataset. The dataset looks as shown below:
|Customer ID||Savings||Assets||Income||Credit risk|
For the above dataset the decision tree could be of the form:
Requirements for building a decision tree:
- As discussed in the definition, a decision tree is a supervised learning algorithm and hence a training dataset is required to train the algorithm.
- Also, a supervised learning algorithm needs a target variable with the training data set for classifying each case.
Few terminological words that should be known before we go any further:
Let me put this for you with the help of a tree. Let’s take the above example.
- The root node is the one from where the tree begins.
- Edges are the lines that are connecting one node to another node.
- Decision nodes are the nodes where the tree is further split into branches.
- Leaf nodes are the end nodes where there is no further split.
Till now, we discussed the introductory part of the decision tree.
Okay, I understand what all questions are going in your mind,
How is a tree constructed?
On what basis the split is made?
Which variable is chosen for the initial split?
All these questions will be answered in the next post “All you need to know about DECISION TREE (Part-2)”