# Logistic Regression in Machine Learning

## Introduction

Logistic Regression is a supervised algorithm in machine learning which operates on a given set of independent variables to generate categorical results such as 0/1, yes/no, true/false. The major difference between Logistic Regression and other classification algorithms is that this algorithm uses a special function known as the Sigmoid Function or the Logistic Function while other classification does not do so.

Logistic regression is primarily used in classification problems more precisely Binary

classification problems where it generates discrete values in output to classify objects into two different classes.

**Table of Contents**

- Linear Regression versus Logistic Regression
- Logistic Function (Sigmoid Function)
- Logistic Equation
- Types of Logistic Regression
- How does it work?
- Advantage of using Logistic Regression
- Problems that can be solved

## Linear Regression versus Logistic Regression

The major difference between Linear Regression and Logistic Regression is that while the former generates continuous outcomes, the later on the other hand gives us discrete categorical values as the result.

For example linear regression is mainly used for prediction continuous value example house price prediction and logistic regression used for classification problem example cat vs dog.

Source: Google Image Source

**Logistic Function (Sigmoid Function)**

- The Logistic Function or as it is commonly called the Sigmoid function is an S shaped curve that takes any real-valued number(continuous values) and maps the value obtained into a value between the range of 0 and 1.
- The function has two asymptotes for values of y at 0 and 1. This is why it always gives us the probability values in the range of 0 to 1.
- The logic behind this is the use of a pre- defined threshold value, which defines the probability of either 0 or 1.
- The Mathematics formula of sigmoid function is

The picture below gives us an insight about this function in detail. The blue colored S shaped curve is our sigmoid function. The three vertical dotted lines represent lower threshold value, x=0 and upper threshold values respectively.

We can clearly see that at the lower and the upper threshold values the graph tends to 0 and 1 respectively, thereby providing us our desired values of probability throughout the curve.

Source: Google Image Source

## Logistic Equation

The Logistic regression equation can be obtained from the equation. The mathematical steps to get its equations are below:

We know the equation of the straight line can be written as:

**y= b0 + b1x1 + b2x2 + b3x3 + … + bnxn**

In this, y can be between 0 and 1 only, so for this let’s divide the above equation by (1-y):

But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will become:

## Types of Logistic Regression

On the basis of the categories, it can be classified into three types:

**Binomial:**In binomial or binary Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, true or false, etc.**Multinomial:**In this, there can be 3 or more possible unordered types of the dependent variable, such as “car”, “truck”, or “plane”.**Ordinal:**In this type of Logistic regression, there can be 3 or more possible values which are ordered types of dependent variables, such as “Primary ”, “Secondary”, or “Senior Secondary”.

## How does it work?

Logistic Regression works by measuring the relationship between the dependent variable (the label provided by us) and one or more independent variables (our features), by estimating probabilities using it’s underlying logistic function.

These probabilities are then transformed into binary values in order to actually make a

prediction. This is the task of the logistic function, also called the sigmoid function. These values between 0 and 1 will then be transformed into either 0 or 1 using a threshold classifier, thereby classifying the data sets into their respective predicted classes.

## Advantage of using Logistic Regression

The edge that we get by using the algorithm followed by Logistic Regression is that it gives us a probability value which then helps us to decide in a better way the likeliness for a particular event to take place. Mainly we use logistic regression for binary classification problem.

## Problems that can be solved

The algorithm followed by Logistic regression can be used to solve problems such as classifying

● emails as spam or not spam;

● diseases like tumor whether they are malignant or benign;

● Online transactions into fraudulent or not.

The list can be endlessly extended to any kind of binary classification.

## Heart Disease classification using Logistic Regression

After covering the theoretical aspects of logistic regression, we will now move on to implementing the same on a Heart Disease data set to classify people whether they will develop heart attack or not, based on various features.

For better understanding and analysis, you can download the data set from the link given below.

https://www.kaggle.com/ronitf/heart-disease-uci

Our very first task is to import all necessary libraries which will be required to carry out this machine learning algorithm on the given data set.

```
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,mean_squared_error
```

After importing the necessary libraries and modules, the next task is to read our data file and store it as a panda data frame. We will pass the file location path as the parameter to the function.

`df = pd.read_csv(r"C:\Users\SRKT\Desktop\Data\HeartAttack.csv",na_values = "?")`

We can view the first five rows of our data frame by using the head() function.

`df.head()`

Now, we will view various parameters of the data frame by using describe function as shown below.

`df.describe()`

The info() method gives the information about the number of non- null values as well the data types and memory usage.

```
df.info()
```**output:**
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294 entries, 0 to 293
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 294 non-null int64
1 sex 294 non-null int64
2 cp 294 non-null int64
3 trestbps 293 non-null float64
4 chol 271 non-null float64
5 fbs 286 non-null float64
6 restecg 293 non-null float64
7 thalach 293 non-null float64
8 exang 293 non-null float64
9 oldpeak 294 non-null float64
10 slope 104 non-null float64
11 ca 3 non-null float64
12 thal 28 non-null float64
13 num 294 non-null int64
dtypes: float64(10), int64(4)
memory usage: 32.3 KB

Now, we will use the isnull().sum() method to know the number of null values for each column. We are doing this to carry out better analysis of our data frame.

```
df.isnull().sum()
```**output:**
age 0
sex 0
cp 0
trestbps 1
chol 23
fbs 8
restecg 1
thalach 1
exang 1
oldpeak 0
slope 190
ca 291
thal 266
num 0
dtype: int64

As we can see clearly that slope, ca and thal are the columns which have most of their values as null, so it is better to remove these columns if we want to make our prediction model more accurate.

```
df = df.drop(columns=["slope","ca","thal"],axis =1)
df.head()
```

In addition to this, we can also remove all the rows where there are null values in any column. This will help us generate even better results.

```
df = df.dropna()
df.head()
```

After carrying out major necessary data cleaning techniques, we will again use the info() and isnull().sum() functions. The outputs now obtained clearly indicate all the steps which we have carried out in our early steps.

```
df.info()
```**output:**
<class 'pandas.core.frame.DataFrame'>
Int64Index: 261 entries, 0 to 293
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 261 non-null int64
1 sex 261 non-null int64
2 cp 261 non-null int64
3 trestbps 261 non-null float64
4 chol 261 non-null float64
5 fbs 261 non-null float64
6 restecg 261 non-null float64
7 thalach 261 non-null float64
8 exang 261 non-null float64
9 oldpeak 261 non-null float64
10 num 261 non-null int64
dtypes: float64(7), int64(4)
memory usage: 24.5 KB

```
df.isnull().sum()
```**output:**
age 0
sex 0
trestbps 0
chol 0
fbs 0
thalach 0
exang 0
oldpeak 0
target 0
cp_1 0
cp_2 0
cp_3 0
cp_4 0
restecg_0.0 0
restecg_1.0 0
restecg_2.0 0
dtype: int64

We will now carry out value_counts() functions on various features to differentiate between categorical and numerical features.

```
df["fbs"].value_counts()
```**output:**
0.0 242
1.0 19
Name: fbs, dtype: int64

```
df["sex"].value_counts()
```**output:**
1 192
0 69
Name: sex, dtype: int64

```
df["exang"].value_counts()
```**output:**
0.0 178
1.0 83
Name: exang, dtype: int64

```
df["cp"].value_counts()
```**output:**
4 113
2 92
3 46
1 10
Name: cp, dtype: int64

```
df.["restecg"].value_counts()
```**output:**
0.0 208
1.0 47
2.0 6
Name: restecg, dtype: int64

From the above generated outputs we can see that age, trestbps, chol, thalach and oldpeak are numerical varibles while sex, fbs, exang, cp and restecg are the categorical variables.

While sex, fbs and fang already are in binary format which is suitable for categorical features, cp and restecg are not in that format. As there are more than two categories in these features, it may lead to incorrect prediction as the machine may consider the higher value to be of more importance. Thus, we will have to carry out one hot encoding in order to prevent this problem.

To do this, we will consider making dummy variables for each such categorical feature in which such a problem is expected. The code for the same is given below:

`df = pd.get_dummies(df,columns=["cp","restecg"])`

After implementing this, the data frame appears as shown below:

`df.head()`

In order to know the exact name and index of the columns as present in the data frame, we will use the following code:

```
df.columns
```**output:**
Index(['age', 'sex', 'trestbps', 'chol', 'fbs', 'thalach', 'exang', 'oldpeak',
'num ', 'cp_1', 'cp_2', 'cp_3', 'cp_4', 'restecg_0.0',
'restecg_1.0', 'restecg_2.0'],
dtype='object')

We will now identify and rename our dependent variable. Here, in this data set, the fact whether the person will have heart attack or not is the dependent variable (num column in the given data set). We will name it as target.

`df=df.rename(columns={"num ":"target"}) `

We will now define our independent variables. Here, there are two kinds of independent variables, numerical and categorical. The code below defines these separately and then, prints the first five rows of the data frame, followed by printing of these features as well.

```
num_cols=["age","trestbps","chol","thalach","oldpeak"]
cat_cols=list(set(df.columns)- set(num_cols)-{"target"})
df.head()
```

```
num_cols
```**output:**
['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

```
cat_cols
```**output:**
['exang',
'restecg_1.0',
'cp_1',
'fbs',
'cp_2',
'sex',
'cp_3',
'restecg_2.0',
'restecg_0.0',
'cp_4']

After defining the features, we will split our entire data into training and testing data in 80% and 20% ratio of the entire data set respectively.

`df_train, df_test=train_test_split(df, test_size=0.2, random_state=41)`

The code below prints the length of training and testing data.

```
len(df_train), len(df_test)
```**output:**
(208, 53)

As we can see that our numerical data and categorical data are not at all in the same range. So, for better prediction results, we can scale our data into similar range. For this, we are using Standard Scaler.

After scaling our numerical features, we will merge all of them into a single horizontal numpy array by using np.hstack() function.

The code below defines a function to do this.

```
scaler=StandardScaler()
def get_features_and_target_arrays(df, num_cols, cat_cols, scaler):
x_numeric_scaled= scaler.fit_transform(df[num_cols])
x_categorical=df[cat_cols].to_numpy()
x=np.hstack((x_categorical, x_numeric_scaled))
y=df["target"]
return x,y
```

As we have already created a function. Now, we can call out that function for our training data.

`x_train, y_train= get_features_and_target_arrays(df_train, cat_cols, num_cols, scaler)`

After all the data preprocessing, we will now move onto the most important step which is fitting and creating a Logistic regression model based on the training data.

```
clf= LogisticRegression()
clf.fit(x_train, y_train)
```**output:**
LogisticRegression()

Data preprocessing needs to be performed on the testing data as well. The below code carries it out on our testing data by passing it to the function we defined earlier.

`x_test, y_test=get_features_and_target_arrays(df_test, cat_cols, num_cols, scaler)`

As our model is already made, we will now predict values using the model we have trained. The predictions are stored in a variable called test_pred as shown below.

`test_pred=clf.predict(x_test)`

As we have carried out our logistic regression algorithm and also made predictions using it, its now time to analyze the error and the accuracy of our model. The code below shows the same process.

```
mean_squared_error(y_test, test_pred)
```**output:**
0.16981132075471697

```
accuracy_score(y_test, test_pred)
```**output:**
0.8301886792452831

As we can see from the outputs, our model has an accuracy of 83% which is good enough.

## End Notes

We hope that this blog helped giving you on the Logistic Regression by and large. The blog is highly insightful as it contains both the theory as well a hand- on implementation using code.

Tag:machine learning