In our previous blog we read in detail about Linear regression, the approach used in this regression, the loss functions, its use and many other aspects of the same. In this blog, we will do house price prediction using the algorithm we read about in our previous blog. In case you haven’t read that blog, I would highly recommend you to first read that blog as the concepts discussed there will be used here in this blog. You can find the link to that blog here.
So, now lets start with our implementation of linear regression on the House Price prediction data set.
For your better understanding and analysis you can download the entire data set on which we are working from the link down below.
House Price Prediction
Here, we will use numpy, pandas, seaborn and matplotlib for carrying out the exploratory data analysis. Apart from this, we will be using scikit learn to implement linear regression. So, the first step undoubtedly is to import all these libraries.
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split
After importing all the necessary libraries and modules. now we will move to the next step which is of course reading our data set. As our data set is in the csv format, so we will use the function read_csv() to read import and read our data set. We will pass the path where this file is located as the argument to this function. This may vary from person to person depending on where this file is located on his/ her pc. This data as you can see in the code is stored in a data frame whose variable name is House_df. Then next, we use the head function. This function returns the first five rows of our data by default. If we pass any number, say head(10), then it will print the first 10 rows of our data.
House_df = pd.read_csv("C:\Learning\python_class\USA_Housing.csv") House_df.head()
After this we will try to analyze our data in order to perform the necessary operations. For this, we will use the describe function. This function helps us to analyze our data based on various parameters like count of all the rows, mean, standard deviation, minimum, maximum values, first quartile, third quartile, etc., as you can see in the output as well.
Next, we check for null values in our data set. If there are many blank or null data in our data set, we need to either remove such rows or fill them with the mean values. Fortunately, our data set does not have any such null values, as you can see below.
House_df.isnull().sum output: Avg. Area Income 0 Avg. Area House Age 0 Avg. Area Number of Rooms 0 Avg. Area Number of Bedrooms 0 Area Population 0 Price 0 Address 0 dtype: int64
Our next task in analyzing the data in order to carry out best predictions is to find the relations between different features and their relation with our dependent variable of Price which is the value to be predicted. In order to do this, we will draw plots to analyze the relations between different features in a more efficient manner.
Now, we define our independent variables or the features which are governing our dependent variable of price. Again, the head function is used to print the first five rows of these features. One fact to note here is the use of the iloc() function and the arguments passed to it. This function uses takes the index of the rows and columns. The value -2 here, signifies that we need not include the last two columns in our independent variable list(x here). Thus, the last 2 columns of price and address haven’t been included in the list. The reason for not including price is that it is our dependent variable hence it will be included in the dependent variable list while address is not very closely related in determining the price, so we are not including it.
x = House_df.iloc[:,:-2] x.head()
As described, next we will define our dependent variable y with the help of the same iloc() function. Here, as you can see we are passing all the rows of the iloc function while only the second last row of price is included(the column at -2 index).
y = House_df.iloc[:,-2] y.head() output: 0 1.059034e+06 1 1.505891e+06 2 1.058988e+06 3 1.260617e+06 4 6.309435e+05 Name: Price, dtype: float64
After this, our next step to split our entire data set into two parts, one being our training set on which our model will be trained, the other being the test set on which we will carry out our predictions. Here, the train and test data are split in an 80% by 20% ratio.
x_train,x_test ,y_train,y_test = train_test_split(x ,y ,test_size = 0.2) len(x_train) , len(x_test) output: (4000, 1000)
After splitting our data, now we will move to the most important step of training our model. For this, we are using the LinearRegression() method of scikit learn library which we had imported earlier. We are also using the fit function to find the best fit for our model.
lr = LinearRegression() lr.fit(x_train,y_train) output: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Now, as our model has been trained by using Linear Regression algorithm, it must definitely have found the line and hence its equation too, which we can suppose to be:
Y = w0+w1x1+w2x2+w3x3+w4x4+w5x5
Here, in this equation w0 is the intercept while w1, w2, and so on, are the coefficients of the above line.
Hence from this you can understand the following lines of code in which we are printing the values of intercepts and coefficients for each of the features.
print("intercept",lr.intercept_) print("coeff",lr.coef_) output: intercept -2635723.8371820096 coeff [2.15573804e+01 1.66052767e+05 1.19682601e+05 1.92727191e+03 1.53161516e+01]
coeff_house = pd.DataFrame(lr.coef_ , index = x.columns , columns= ["coeff"]) coeff_house
Now, after training our data our next task is to train our data based on the model that we have trained. So, in the next few lines of code we are predicting the values of prices for our test data and then, drawing the scatter plot to see if our predictions made are rational enough. The nature of our scatter plot obtained below shows that our data has been predicted good enough.
pred = lr.predict(x_test) plt.scatter(y_test,pred)
Now, its time to evaluate our model. As, we have discussed in our previous blog about the evaluation metrics. Applying the same metrics, we are evaluating our model by finding the various errors.
from sklearn.metrics import mean_absolute_error , mean_squared_error , explained_variance_score mean_absolute_error(pred,y_test) output: 79377.89910545385
At last, after carrying out all the steps of training our model, making predictions and evaluating our predictions, we will now see our variance score which tells us about the accuracy percentage to which our model had predicted the values. As you can see below our accuracy is about 91.7% which makes our model a very good fit for both training and testing data.
lr.score(x_test,y_test) output: 0.9162733312441765 print("Accuracy" ,explained_variance_score(y_test,pred)) output: Accuracy 0.9165265952722778
So, with this blog we complete our Linear Regression algorithm in machine learning. I hope that this blog helped you get into very technical details of linear regression and its implementation with the help of numpy, pandas, seaborn, matplotlib and sklearn.