In this blog we will hands on machine learning project i.e Diabetes prediction using Logistic Regression before hands on you must understand the concept of Logistic regression https://ainewgeneration.com/logistic-regression/ After going through the Logistic Regression Blog , there must be definitely some interest to understand how this thing actually work with real world data. To answer that question, we will here discuss elaborately the application of Logistic Regression on a dataset .
So we will make the journey through use of NumPy, pandas, seaborn and matplotlib for carrying out the exploratory data analysis. Apart from this, we will be using scikit learn to implement linear regression. So, the first step undoubtedly is to import all these libraries
import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split
Post importing the required libraries and modules, the next course of action is to read the raw dataset & to read it , we need to use the read_csv () function for the same. It should be duly noted that the argument needs to be pass under a variable name so that the variable can inculcate the dataset and it becomes convenient going ahead to call just the variable whenever there will be need for the dataframe/dataset.
df = pd.read_csv(r"C:\Learning\python_class\Logistic\diabetes.csv")
Once the above step has been properly executed, next we use the head() function. The purpose of this function returns the first five rows of our data by default. Well, its not restricted to just 5 rows, we can pass any number of row we would like to view, For instance: – head(20), then it will print the first 20 rows of our dataframe.
Now , before embarking on the core modelling part, we will check the data types. For this, we will use the dtypes function.
Machine learning model is completely mathematics, it understand basically just numbers but nothing qualitative in nature. As you all can see from the above step, that the variable Class has data which are qualitative in nature. Also the Class variable happens to be the target variable as well ( target variable – the dependent variable which we need to predict). So this variable needs to be converted in quantitative in nature which can be done by using dummy variable. Whenever a categorical variable has more than two categories, it can be represented by a set of dummy variables
df= pd.get_dummies(df,columns=["Class"], drop_first=True) df.head()
At this step, we will check for null values in our data set. Blank or null data in our data set poise a challenge which ultimately impact model accuracy & performance. We need to either remove such null values or fill them with the mean values, these judgement needs to be taken by seeing how many null values are there actually out of the total data for a specific variable.
Before going further to design the predictive modelling, we need to complete a small step of defining our independent & dependent variables or the features. This is accomplished by using of iloc function. iloc helps in the convenient selection of data from the DataFrame. It’s used in filtering the data according to some conditions.
x_feature= df.iloc[:,0:7] x_feature
A one more process needs to be done so that we can scale the entire independent variables in a fix scale/ range. You all might have noticed that all these independent variables have wide range of values. Feature Scaling is the technique which execute the function of standardizing the independent features present in the data within a fixed range. If supposedly, feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.
sc = StandardScaler() df_scale = sc.fit_transform(x_feature) df_scale
We have completed all the pre-processing steps, we are good to go with splitting the data into train and test. Train one being on which our model will be trained, the other being the test set on which predictions will be carried out . Here, the train and test data are split in an 80% by 20% ratio.
x_train , x_test,y_train,y_test = train_test_split(df_scale,y_target,test_size =0.2)
len(x_train) , len(x_test)
After splitting the data ,the most important step of training the model follows. For this, the LogisticRegression() function to be used. The fit function needs to be also used to find the best fit for our model.
lr=LogisticRegression() lr.fit(x_train, y_train)
test_pred = lr.predict(x_test) test_pred
After fitting the LogisticRegression, the model needs to be evaluated finding accuracy score & error
from sklearn.metrics import accuracy_score, mean_squared_error AccuracyScore= accuracy_score(y_test, test_pred) Error= mean_squared_error(y_test , test_pred)
The accuracy score stands at 81% which is pretty decent to substantiate that our model prediction has been good enough. The accuracy is a simple way of measuring the effectiveness of your model. But do keep searching and other methodologies to evaluate the model. Sometime accuracy score doesn’t say the whole story.
So, with this blog we complete our explanation how Logistic Regression algorithm in machine learning can be executed . I hope, this blog helped you to get an overview into every technical details that comes into play for logistic regression.