In this blog we are going to cover Random Forest, which is a popularly used algorithm in machine learning. It is a supervised type of machine learning. A supervised algorithm has a labelled training dataset which is used to make a model which then predicts results for the test dataset.
Before this blog, we have been covering several other algorithms in machine learning as well like Linear Regression, Logistic Regression and Decision tree. In order to understand Random Forest precisely, Decision Tree is a pre- requisite. Thus, we strongly recommend you to go through it. The link to the same can be found right below.
Decision tree blog: https://ainewgeneration.com/decision-tree-in-machine-learning/
Table of Contents:
1) What is Random Forest?
2) Why to use Random Forest?
3) Ensemble Learning and its Types
4) Bagging in Ensemble Learning
5) How does Random Forest works?
6) Working Procedure of Random Forest
What is Random Forest?
Random Forest algorithm is actually a supervised machine learning algorithm. It uses a method in ensemble learning by constructing multiple Decision Trees to solve problems. Random Forests are used to solve both Classification as well as Regression problems and hence they also come under CART.
Why to use Random Forest?
As we studied that Random Forests use multiple Decision Trees, so the working procedure should be almost same as Decision trees, you might be thinking then why do we use Random Forest?
The answer to this lies in the fact that though Random Forest use Decision trees only, but as they predict result from multiple models, they are found to predict more accurate outcome as predicted results. The final output in this case is decided by using Majority voting i.e. the outcome which is predicted by most of the models is considered to be the final result.
Ensemble Learning and its Types
Ensemble learning actually means combining multiple models together to generate outcomes. As we combine multiple decision trees in Random forests so, Random forest uses Ensemble learning.
By combining multiple models we mean that we do not depend on one particular model for our output completely. Rather, we formulate several such models, train these on our training data set and then use them to predict our output. The final output is the one which is given by majority of these models. Thus, we can say that it is the non- dependency of Ensemble Learning on one particular model which makes it more accurate.
The image below shows a better understanding about Ensemble learning from the dataset point of view.
As you can see in this image, the Models 1, 2, 3, and 4 are being given data from the original training data.
And these models are not based on any single algorithm i.e. Model 1 can be Naive Bayes, Model 2 can be the SVM algorithm, Model 3 can be Decision Tree, and Model 4 can be K-nearest neighbor.
Majorly, if talk about the types of Ensemble Learning, it is of 2 types:
1) Bagging or Bootstrap Aggregation
We will cover Boosting in detail in other blogs. For now, let us stick to Random Forests, which falls under Bagging type of Ensemble Learning.
Bagging in Ensemble Learning
Bagging, also known as Bootstrap Aggregation can be understood by focusing on the two words, Bootstrap and Aggregation as these two words together explain us the working procedure of Bagging.
Bootstrap- This refers to the step in which the training dataset is split into multiple small datasets, also known as Bootstrap Datasets.
Aggregation- This refers to the last step in which we combine the results from all the models by Majority Voting to predict out final output. This process of combining results from various models is known as Aggregation.
Thus, Bagging being a combination of both, uses both these steps, bootstraping as well as aggregation. The image below entails this further.
Coming to the Bootstrap Dataset. It is made by random selection and with replacement. Random selection simply means that the original dataset is divided randomly into various small bootstrap datasets which is fed into different models. On the other hand with replacement means that there can be repetition of data. In other words, if a particular row of data is present in D1 bootstrap dataset, it can also be present in D2 bootstrap dataset. This is what we mean by Row Replacement.
The image below explain this in a more lucid manner.
In this image, we have randomly created two Datasets D1 and D2. As you can see, in dataset D1, we randomly select Row 1, Row 2, and again randomly we select Row 1, and then Row 3. Similarly in Dataset 2, we randomly select Row 1, Row 2, Row 4 and Row 2. Here, we can see that Row 2 has been repeated. Thus, repetition is allowed.
Moreover, apart from Row Sampling, Random forests also use Column Sampling i.e. one or more columns(features) may or may not be present in our bootstrap datasets or bootstrap samples. Thus, after creating our datasets based on above steps, we then train different models with different datasets, not with the same dataset. When all these models are trained, then we combine these models to create a new model i.e. the Final Ensemble Model. This final ensemble model is a strong model as it has high accuracy, precision, and predictive power with minimum error rate. This is so because final model performs majority voting.
How does Random Forest works?
Random Forest works in the same way as Bagging. But, there are some differences between Random Forest and Bagging. In Bagging, models can use different algorithms. For instance Model 1 can use the Naive Bayes Algorithm, Model 2 can use the Support Vector Machine, so and so forth. However, in Random Forest, all the models use only the Decision Tree Algorithm i.e. all the base learners or models use the Decision Tree Algorithm to train their model. Another difference between Bagging and Random Forest being that in bagging, we only perform “Row Sampling with Replacement” but in Random Forest, we also perform “Column/Feature Sampling with Replacement”. That means we can choose any column or reject any column from the original training dataset in order to create our bootstrap dataset.
Working Procedure of Random Forest
Step 1: The very first step in the implementation of Random Forest is creating the small Bootstrap datasets from Original training dataset. This bootstrap dataset is a randomly selected dataset, in which we perform Feature/Column Sampling with Replacement along with Row Sampling with Replacement. As the rows are selected randomly, so the name Random Forest. The image below shows an example which would make the idea behind more clear.
Step 2: After creating all bootstrap datasets as we did in step 1. Now the next step involves creating a Decision Tree. Using the Dataset 1, we can create Decision Tree 1. If we have four features- Age, Gender, BP, and Weight in dataset 1. We can randomly select any features and make a Decision Tree. It is not at all compulsory to use all the features. Like in normal Decision Tree, we considered all features for choosing as a Root Node. However, here We can choose only two features, and calculate information gain and entropy of only these two features. But as these are randomly chosen features, so this will also serve our purpose too.
Suppose, we choose the features of Age and Gender, and then calculate the information gain and entropy of only Age and Gender. Then next, we are to choose one of them as a Root Node. Say Age has high information gain, so we make Age as Root Node. In a similar way, for choosing the next node, we randomly choose any features (not all) and calculate information gain and entropy. Suppose by choosing Age, Gender, and BP features, we have created this Decision Tree.
On a similar note, we can create multiple Decision Tree by randomly choosing any Features.
Step 3: After creating multiple decision trees, now we will train all our decision trees with Bootstrap Dataset. As you can see in the image below, multiple Decision trees are trained with their corresponding Bootstrap dataset.
Step 4: Once all the model has been trained, we will now perform Testing. So, the Random Forest algorithm will input the test data set to every decision tree asking them to predict the output. All these Decision Trees will predict in Yes or No.
Once all the decision trees have predicted, the random forest algorithm will then perform majority voting to give our final result as output. This will increase our accuracy by and large.
I hope this blog helped you to get yourself familiar with Random Forest algorithm in machine learning. In our next blogs we will cover more such algorithms.