This blog deals with Decision Tree which is one of the most popular machine learning algorithm. There are blogs in other basic machine learning algorithms such as Linear Regression and Logistic Regression. Before proceeding with this blog, we would highly recommend that you read it for a better understanding. Links to the same can be found below.
Linear Regression: https://ainewgeneration.com/linear-regression-in-machine-learning/
Logistic Regression: https://ainewgeneration.com/logistic-regression/
Decision tree works for both Regression and Classification issues. Therefore, the term Classification And Regression Tree (CART) is often used to refer to this. This algorithm is widely used to solve categories of problems such as those involved in diagnostics and medical predictions such as cancer speculation in patients, determining access to credit risk for loan applicants.
Table of Contents
- Basic Terms
- The operation of the decision tree
- ASM (Rate Selection Rate)
- Information Gain and Information Gain Ratio
- Gini Index
Some basic terms involved in Decision Tree are as follows:
Root Node: The root of a node comes from where the decision tree starts. It is the highest node of the decision tree. Represents all databases, which continue to be divided into two or more identical sets.
Leaf Nodes: Leaf nodes are the final nodes, and the tree cannot be further separated after acquiring a leaf node. Represents the final output values obtained. The leaf paths on the decision tree remain pure so no further separation is required.
Splitting: The process of dividing the decision node / decision node into sub-nodes according to the given conditions is known as splitting. Whenever we get to a non-polluting area that contains positive and negative responses, we need to separate the tree in order to reach the last pure extraction or leaf node.
Branch / Tree Below: A tree divided by a tree is called a branch or tree under a tree of choice.
Pruning: Pruning is the process of removing unwanted branches from a tree.
Parent node: The root node of a tree is called the parent node.
Child node / internal node: The nodes outside the root node in the decision tree are called the child nodes.
The figure below shows the standard decision tree for better understanding.
The operation of the Decision Tree
We will use the Weather dataset for studying how a decision tree operates. This dataset consists of 4 features which are Day, Outlook, Humidity, Wind and one output variable Play(yes or no). Here, we are analyzing the weather and on the basis of that we will predict whether or not Alice will play tennis or not. Thus, the play feature is the desired output variable which is categorical in nature.
As we can see, there are 14 rows of data or instances that are needed to be learnt for mapping between X (independent variable) and Y (dependent variable).
The decision tree will take the training set and split up it into the smaller subsets based on features. We will keep on repeating this process at every node of a tree with different subsets and attributes till there is no uncertainty that Alice will play or not.
Using the Outlook feature to split the dataset. As we can see, there are 3 possible values that exist for Outlook feature. These are Sunny, Overcast and Rain. This makes our decision tree look like the following:
After splitting, we can see that we have found one pure node (with only 1 type of categorical value) in which Alice always plays. Hence it can be said that when the Outlook is Overcast Alice always plays.
We also have two impure sets, in which it is not sure whether Alice will play or not as this contains both the types of categorical values yes as well as no. Thus, for these nodes, we need to further split into sub- trees or branches based on other features as shown below:
After this, we have two pure and one impure node. We can say, by looking at the decision tree that when there is Sunny outlook and high humidity, Alice will not play and for normal humidity Alice plays.
Now, for Rainy subset, we need to further split it, as it is impure. Thus, we will use the Wind feature of our data set to carry out this splitting, as shown below:
At this stage, we have arrived at leaf nodes from all our branches. This means that we have split our entire training set into smaller subsets based on the small set of rules/ conditions/ features and each subset is pure.
Thus, presenting the final decision tree after replacing the instances by labels below:
ASM (Attribute Selection Measure)
ASM or Attribute Selection Measure is the process by which we select the best feature to start building our decision tree.
Thus, with the help of ASM, we can easily select best features for the respective nodes of our decision tree.
Majorly, there are two techniques for carry out ASM, which are:
a) Information Gain
b) Gini Index
The amount of uncertainty present in our data set is defined as the Entropy of the dataset. Before we split our data into branches of the decision tree, we are quite uncertain. The reason behind this is the entropy of our dataset. We can explain this in simple terms as well. If after splitting, we obtain only positive responses(yes) then it is clear and we are almost certain about the final output. However, if the node after splitting contains both a mixture of both positive (yes) as well as negative(no) responses, then we are not certain about the final output. This means that there is some entropy in that node. Thus, for easy and quick splitting of decision tree the entropy of our data should be minimum. The formula to calculate entropy is given as:
Here, yi is an o/p variable which has ‘k’ possible cases.
Note that the entropy is calculated to specific subset so the p(yi) is the probability of yi that belongs to that particular subset, not the entire dataset.
Information Gain is one of the methods used in ASM. In this, we multiply the entropies after the split by |Sv|/|S|, where Sv is the number of possible values in feature A and S is the number of values before the split.
The feature which gives the highest value for information gain is used to split the decision tree. Information gain is actually used to calculate how much information a feature provides about the class.
However, information gain comes with some flaws and it happens sometimes that the feature which has the maximum value of information gain is actually not so suitable for splitting the data set. Thus, another method Information Gain ratio is used for splitting.
IG ratio is calculated as follows:
Just like Information Gain, Gini impurity or Gini Index is another parameter on the basis of which splitting of decision tree is done. It is actually a measure of impurity while creating a decision tree in the CART algorithm.
In contrast to Information Gain, the feature which has a low Gini Index should be preferred for splitting the tree.
Gini Index can be calculated using the following formula:
Gini Index= 1- ∑jPj^2,
where Pj stands for the probability.
I hope this blog on Decision tree helped you to get started with Decision tree algorithm in Machine Learning. In the next blog, we will cover the implementation of decision tree on a real dataset for better practical insights.