Car Evaluation using Machine Learning
Introduction
As you might have got some idea from the title of the blog itself, in this blog we will carry out predictions on a Car Evaluation using machine learning. Coming to the algorithm that we are going to use in order to carry out these predictions. We are going to do this prediction using three different algorithms namely Logistic Regression, Decision tree and Random forest. This is a classification problem i.e. we need to classify the car as, unacceptable, acceptable, good and very good based on a number of features. For better analysis of the problem you can download the dataset from below link.
Car Evaluation dataset: https://drive.google.com/file/d/1pqL9NdCrX1_xJYLZqStzpMmyEYXq0ot4/view?usp=sharing
Also, if you have not gone through the blogs relating to the algorithms which we are going to use here, we will recommend doing that first. The links have been added below for each one of these.
Logistic Regression: https://ainewgeneration.com/logistic-regression/
Decision Tree: https://ainewgeneration.com/decision-tree-in-machine-learning/
Random Forest: https://ainewgeneration.com/random-forest-in-machine-learning/
Table of Contents
- Data Analysis and Preprocessing
- Car Evaluation prediction using Logistic Regression
- Car Evaluation prediction using Decision Tree
- Car Evaluation prediction using Random Forest
1. Data Analysis and Preprocessing
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,mean_squared_error
First things firsts. As we do in our every project, our foremost task is to import all the necessary libraries that are going to be used in our code.
df=pd.read_csv(r"D:\Project_Datasets\Car Evaluation\car_evaluation.csv")
Post the import of necessary libraries, we will read our dataset which is a comma- separated value form(csv file) and will convert that into a panda data frame named df.
df.head()

col_names = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
df.columns = col_names
col_names
output:
['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
As we can see that the columns of our dataset have not been named, so here we are naming our columns and then printing them. The output of this code is also shown above.
df.head()

df.describe()

df.info()
output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1727 entries, 0 to 1726
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 buying 1727 non-null object
1 maint 1727 non-null object
2 doors 1727 non-null object
3 persons 1727 non-null object
4 lug_boot 1727 non-null object
5 safety 1727 non-null object
6 class 1727 non-null object
dtypes: object(7)
memory usage: 94.6+ KB
Here, we are using the info() function to know about each column of the dataset. Next, we will use the isnull().sum() function to see if there are any null values in the data as shown below.
df.isnull().sum()
output:
buying 0
maint 0
doors 0
persons 0
lug_boot 0
safety 0
class 0
dtype: int64
The following code segments and their outputs show the various value types of features present in our dataset.
df["buying"].value_counts()
output:
med 432
high 432
low 432
vhigh 431
Name: buying, dtype: int64
df["maint"].value_counts()
output:
med 432
high 432
low 432
vhigh 431
Name: maint, dtype: int64
df["doors"].value_counts()
output:
5more 432
4 432
3 432
2 431
Name: doors, dtype: int64
df["persons"].value_counts()
output:
more 576
4 576
2 575
Name: persons, dtype: int64
df["lug_boot"].value_counts()
output:
big 576
med 576
small 575
Name: lug_boot, dtype: int64
df["safety"].value_counts()
output:
med 576
high 576
low 575
Name: safety, dtype: int64
df["class"].value_counts()
output:
unacc 1209
acc 384
good 69
vgood 65
Name: class, dtype: int64
As we see that all the features of our dataset are categorical in nature so we will create dummy variables for independent variables.
df = pd.get_dummies(df,columns=["buying",
"maint","doors","persons","lug_boot","safety"])
The class feature of our dataset is our target variable, so we are renaming it just for our own sake of understanding. You can skip this step if you want.
df=df.rename(columns={"class":"target"})
df.head()

Next, we are defining our independent categorical variable as shown below. Here, we are including all featres except our target(or dependent) variable.
cat_cols=list(set(df.columns)-{"target"})
If the target variable is also categorical, then we use label encoder to encode it and derive our results.
le=LabelEncoder()
y=le.fit_transform(df["target"])
X = df.drop(["target"], axis=1)
Then, just as we do for every machine learning project, we are splitting the dataset into test and train sets in 70: 30 ratio.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)
After this we are investigating the shape i.e. no. of rows and columns of our test and train dataset.
X_train.shape, X_test.shape
output:
(1157, 21), (570, 21))
2. Car Evaluation prediction using Logistic Regression
Finally, its time to build a model. First we will build a model using logistic regression.
from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train, y_train)
output:
LogisticRegression()
After building a model, we will then predict the values of our test dataset.
test_pred=clf.predict(X_test)
After our prediction has taken place, we will now calculate the error in our test data and its predicted values.
mean_squared_error(y_test, test_pred)
output:
0.37894736842105264
After doing everything, our last task involves calculating the accuracy of our results.
accuracy_score(y_test, test_pred)
output:
0.9017543859649123
3. Car Evaluation prediction using Decision Tree
After carrying out prediction using Logistic Regression, next we will move on to making a model using Decision tree. First we are importing the library for the same, followed by fitting our model into it. It is to be noted that we are using gini index as our attribute selection measure for this decision tree model.
from sklearn.tree import DecisionTreeClassifier
max_depth = 5
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=max_depth, random_state=0)
# clf_gini = DecisionTreeClassifier(criterion='gini', random_state=0)
# fit the model
clf_gini.fit(X_train, y_train)
output:
DecisionTreeClassifier(max_depth=5, random_state=0)
After building a model, we will then predict the values of our test dataset.
y_pred_gini = clf_gini.predict(X_test)
After carrying our prediction, finally we are calculating accuracy of our model by using the following code:
print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))
output:
Model accuracy score with criterion gini index: 0.8404
4. Car Evaluation prediction using Random Forest
Now, next we will move on to making a model using Random Forest. First we are importing the algorithm which is a part of the sklearn’s ensemble module. Post this, we are building the model for our training dataset. It is to be noted that we are using a model involving 1000 estimators in this Random Forest model. Then we are carrying out predictions as shown in the code below:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=1000, random_state=0)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
After all this, we are finally, calculating the error in the predictions made by our model on the test dataset. As we can see in the output, the error is as less as 14%.
mean_squared_error(y_test, y_pred)
output:
0.14210526315789473
Finally, we are using the accuracy_score metric to judge our model on the basis of its accuracy. As we can clearly see that the accuracy comes out to be as high as 96% which means our model is good enough.
accuracy_score(y_test, y_pred)
output:
0.9631578947368421
End Notes
We hope that this blog gave you a great idea about training models and carrying out predictions using different types of machine learning algorithms. We will come up with more of such insightful blogs in the future.