• Home
  • Blog
  • Our Services
  • Contact Us
  • Register
Have any question?

(+91) 844 745 8168
[email protected]”>
RegisterLogin
AI Next Generation
  • Home
  • Blog
  • Our Services
  • Contact Us
  • Register

Machine Learning

  • Home
  • Blog
  • Machine Learning
  • Logistic Regression in Machine Learning

Logistic Regression in Machine Learning

  • Posted by Natasha
  • Date August 3, 2021
  • Comments 0 comment

Introduction

Logistic Regression is a supervised algorithm in machine learning which operates on a given set of independent variables to generate categorical results such as 0/1, yes/no, true/false. The major difference between Logistic Regression and other classification algorithms is that this algorithm uses a special function known as the Sigmoid Function or the Logistic Function while other classification does not do so.

Logistic regression is primarily used in classification problems more precisely Binary
classification problems where it generates discrete values in output to classify objects into two different classes.

Table of Contents

  1. Linear Regression versus Logistic Regression
  2. Logistic Function (Sigmoid Function)
  3. Logistic Equation
  4. Types of Logistic Regression
  5. How does it work?
  6. Advantage of using Logistic Regression
  7. Problems that can be solved

 

Linear Regression versus Logistic Regression

The major difference between Linear Regression and Logistic Regression is that while the former generates continuous outcomes, the later on the other hand gives us discrete categorical values as the result.

For example linear regression is mainly used for prediction continuous value example house price prediction and logistic regression used for classification problem example cat vs dog.

Source: Google Image Source

Logistic Function (Sigmoid Function)

  • The Logistic Function or as it is commonly called the Sigmoid function is an S shaped curve that takes any real-valued number(continuous values) and maps the value obtained into a value between the range of 0 and 1.
  • The function has two asymptotes for values of y at 0 and 1. This is why it always gives us the probability values in the range of 0 to 1.
  • The logic behind this is the use of a pre- defined threshold value, which defines the probability of either 0 or 1.  
  • The Mathematics formula of sigmoid function  is

The picture below gives us an insight about this function in detail. The blue colored S shaped curve is our sigmoid function. The three vertical dotted lines represent lower threshold value, x=0 and upper threshold values respectively.

We can clearly see that at the lower and the upper threshold values the graph tends to 0 and  1 respectively, thereby providing us our desired values of probability throughout the curve.

Source: Google Image Source

Logistic Equation

The Logistic regression equation can be obtained from the equation. The mathematical steps to get its equations are below:

We know the equation of the straight line can be written as:

y= b0 + b1x1 + b2x2 + b3x3 + … + bnxn

In this, y can be between 0 and 1 only, so for this let’s divide the above equation by (1-y):

But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

Types of Logistic Regression

On the basis of the categories, it can be classified into three types:

  • Binomial: In binomial or binary Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, true or false, etc.
  • Multinomial: In this, there can be 3 or more possible unordered types of the dependent variable, such as “car”, “truck”, or “plane”.
  • Ordinal: In this type of Logistic regression, there can be 3 or more possible values which are ordered types of dependent variables, such as “Primary ”, “Secondary”, or “Senior Secondary”.

How does it work?

Logistic Regression works by measuring the relationship between the dependent variable (the label provided by us) and one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.

These probabilities are then transformed into binary values in order to actually make a
prediction. This is the task of the logistic function, also called the sigmoid function. These values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier, thereby classifying the data sets into their respective predicted classes.

Advantage of using Logistic Regression

The edge that we get by using the algorithm followed by Logistic Regression is that it gives us a probability value which then helps us to decide in a better way the likeliness for a particular event to take place. Mainly we use logistic regression for binary classification problem.

Problems that can be solved

The algorithm followed by Logistic regression can be used to solve problems such as classifying
● emails as spam or not spam;
● diseases like tumor whether they are malignant or benign;
● Online transactions into fraudulent or not.

The list can be endlessly extended to any kind of binary classification.

Heart Disease classification using Logistic Regression

After covering the theoretical aspects of logistic regression, we will now move on to implementing the same on a Heart Disease data set to classify people whether they will develop heart attack or not, based on various features.

For better understanding and analysis, you can download the data set from the link given below.

https://www.kaggle.com/ronitf/heart-disease-uci

Our very first task is to import all necessary libraries which will be required to carry out this machine learning algorithm on the given data set.

import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,mean_squared_error

After importing the necessary libraries and modules, the next task is to read our data file and store it as a panda data frame. We will pass the file location path as the parameter to the function.

df = pd.read_csv(r"C:\Users\SRKT\Desktop\Data\HeartAttack.csv",na_values = "?")

We can view the first five rows of our data frame by using the head() function.

df.head()

Now, we will view various parameters of the data frame by using describe function as shown below.

df.describe()

The info() method gives the information about the number of non- null values as well the data types and memory usage.

df.info()

output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         294 non-null    int64  
 1   sex         294 non-null    int64  
 2   cp          294 non-null    int64  
 3   trestbps    293 non-null    float64
 4   chol        271 non-null    float64
 5   fbs         286 non-null    float64
 6   restecg     293 non-null    float64
 7   thalach     293 non-null    float64
 8   exang       293 non-null    float64
 9   oldpeak     294 non-null    float64
 10  slope       104 non-null    float64
 11  ca          3 non-null      float64
 12  thal        28 non-null     float64
 13  num         294 non-null    int64  
dtypes: float64(10), int64(4)
memory usage: 32.3 KB

Now, we will use the isnull().sum() method to know the number of null values for each column. We are doing this to carry out better analysis of our data frame.

df.isnull().sum()

output:
age             0
sex             0
cp              0
trestbps        1
chol           23
fbs             8
restecg         1
thalach         1
exang           1
oldpeak         0
slope         190
ca            291
thal          266
num             0
dtype: int64

As we can see clearly that slope, ca and thal are the columns which have most of their values as null, so it is better to remove these columns if we want to make our prediction model more accurate.

df = df.drop(columns=["slope","ca","thal"],axis =1)
df.head()

In addition to this, we can also remove all the rows where there are null values in any column. This will help us generate even better results.

df = df.dropna()
df.head()

After carrying out major necessary data cleaning techniques, we will again use the info() and isnull().sum() functions. The outputs now obtained clearly indicate all the steps which we have carried out in our early steps.

df.info()

output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 261 entries, 0 to 293
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         261 non-null    int64  
 1   sex         261 non-null    int64  
 2   cp          261 non-null    int64  
 3   trestbps    261 non-null    float64
 4   chol        261 non-null    float64
 5   fbs         261 non-null    float64
 6   restecg     261 non-null    float64
 7   thalach     261 non-null    float64
 8   exang       261 non-null    float64
 9   oldpeak     261 non-null    float64
 10  num         261 non-null    int64  
dtypes: float64(7), int64(4)
memory usage: 24.5 KB
df.isnull().sum()

output:
age            0
sex            0
trestbps       0
chol           0
fbs            0
thalach        0
exang          0
oldpeak        0
target         0
cp_1           0
cp_2           0
cp_3           0
cp_4           0
restecg_0.0    0
restecg_1.0    0
restecg_2.0    0
dtype: int64

We will now carry out value_counts() functions on various features to differentiate between categorical and numerical features.

df["fbs"].value_counts()

output:
0.0    242
1.0     19
Name: fbs, dtype: int64
df["sex"].value_counts()

output:
1    192
0     69
Name: sex, dtype: int64
df["exang"].value_counts()

output:

0.0    178
1.0     83
Name: exang, dtype: int64
df["cp"].value_counts()

output:
4    113
2     92
3     46
1     10
Name: cp, dtype: int64
df.["restecg"].value_counts()

output:
0.0    208
1.0     47
2.0      6
Name: restecg, dtype: int64

From the above generated outputs we can see that age, trestbps, chol, thalach and oldpeak are numerical varibles while sex, fbs, exang, cp and restecg are the categorical variables.

While sex, fbs and fang already are in binary format which is suitable for categorical features, cp and restecg are not in that format. As there are more than two categories in these features, it may lead to incorrect prediction as the machine may consider the higher value to be of more importance. Thus, we will have to carry out one hot encoding in order to prevent this problem.

To do this, we will consider making dummy variables for each such categorical feature in which such a problem is expected. The code for the same is given below:

df = pd.get_dummies(df,columns=["cp","restecg"])

After implementing this, the data frame appears as shown below:

df.head()

In order to know the exact name and index of the columns as present in the data frame, we will use the following code:

df.columns

output:
Index(['age', 'sex', 'trestbps', 'chol', 'fbs', 'thalach', 'exang', 'oldpeak',
       'num       ', 'cp_1', 'cp_2', 'cp_3', 'cp_4', 'restecg_0.0',
       'restecg_1.0', 'restecg_2.0'],
      dtype='object')

We will now identify and rename our dependent variable. Here, in this data set, the fact whether the person will have heart attack or not is the dependent variable (num column in the given data set). We will name it as target.

df=df.rename(columns={"num       ":"target"})  

We will now define our independent variables. Here, there are two kinds of independent variables, numerical and categorical. The code below defines these separately and then, prints the first five rows of the data frame, followed by printing of these features as well.

num_cols=["age","trestbps","chol","thalach","oldpeak"]
cat_cols=list(set(df.columns)- set(num_cols)-{"target"})
df.head()
num_cols

output:
['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
cat_cols

output:
['exang',
 'restecg_1.0',
 'cp_1',
 'fbs',
 'cp_2',
 'sex',
 'cp_3',
 'restecg_2.0',
 'restecg_0.0',
 'cp_4']

After defining the features, we will split our entire data into training and testing data in 80% and 20% ratio of the entire data set respectively.

df_train, df_test=train_test_split(df, test_size=0.2, random_state=41)

The code below prints the length of training and testing data.

len(df_train), len(df_test)

output:
(208, 53)

As we can see that our numerical data and categorical data are not at all in the same range. So, for better prediction results, we can scale our data into similar range. For this, we are using Standard Scaler.

After scaling our numerical features, we will merge all of them into a single horizontal numpy array by using np.hstack() function.

The code below defines a function to do this.

scaler=StandardScaler()

def get_features_and_target_arrays(df, num_cols, cat_cols, scaler):
    x_numeric_scaled= scaler.fit_transform(df[num_cols])
    x_categorical=df[cat_cols].to_numpy()
    x=np.hstack((x_categorical, x_numeric_scaled))
    y=df["target"]
    
    return x,y

As we have already created a function. Now, we can call out that function for our training data.

x_train, y_train= get_features_and_target_arrays(df_train, cat_cols, num_cols, scaler)

After all the data preprocessing, we will now move onto the most important step which is fitting and creating a Logistic regression model based on the training data.

clf= LogisticRegression()
clf.fit(x_train, y_train)

output:
LogisticRegression()

Data preprocessing needs to be performed on the testing data as well. The below code carries it out on our testing data by passing it to the function we defined earlier.

x_test, y_test=get_features_and_target_arrays(df_test, cat_cols, num_cols, scaler)

As our model is already made, we will now predict values using the model we have trained. The predictions are stored in a variable called test_pred as shown below.

test_pred=clf.predict(x_test)

As we have carried out our logistic regression algorithm and also made predictions using it, its now time to analyze the error and the accuracy of our model. The code below shows the same process.

mean_squared_error(y_test, test_pred)

output:
0.16981132075471697
accuracy_score(y_test, test_pred)

output:
0.8301886792452831

As we can see from the outputs, our model has an accuracy of 83% which is good enough.

End Notes

We hope that this blog helped giving you on the Logistic Regression by and large. The blog is highly insightful as it contains both the theory as well a hand- on implementation using code.

Tag:machine learning

  • Share:
author avatar
Natasha

Previous post

MNIST Digit Recognization with PyTorch
August 3, 2021

Next post

Linear Regression in Machine Learning
August 5, 2021

You may also like

nbc1
Naive Bayes in Machine Learning
28 September, 2021
featured (1)
Gradient Boosting In Machine Learning
19 September, 2021
support-vector-machine-cover
SVM in Machine Learning
15 September, 2021

Leave A Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Garbage Classification using CNN Model
  • Brain Tumor Prediction PyTorch[CNN]
  • Covid-19 X-ray prediction using Deep Learning
  • Data Analysis for Olympic 2021
  • Naive Bayes in Machine Learning

Categories

  • Data Science
  • Deep Learning
  • Machine Learning
  • Python

Archives

  • December 2021
  • November 2021
  • September 2021
  • August 2021
  • July 2021

(+91) 844 745 8168

[email protected]

COMPANY

  • Blog
  • Our Services
  • Contact Us

LINKS

  • Home
  • Blog
  • Activity
  • Checkout

RECOMMEND

  • Cart
  • Members
  • Sample Page
  • Shop

SUPPORT

  • Members
  • My account
  • Register
  • Shop

Copyright © 2021 AI New Generation

Become an instructor?

Join thousand of instructors and earn money hassle free!

Get started now

Login with your site account

Lost your password?

Not a member yet? Register now

Register a new account

Are you a member? Login now