• Home
  • Blog
  • Our Services
  • Contact Us
  • Register
Have any question?

(+91) 844 745 8168
[email protected]”>
RegisterLogin
AI Next Generation
  • Home
  • Blog
  • Our Services
  • Contact Us
  • Register

Machine Learning

  • Home
  • Blog
  • Machine Learning
  • KNN in Machine Learning

KNN in Machine Learning

  • Posted by Natasha
  • Date September 9, 2021
  • Comments 0 comment

Introduction

KNN in machine learning stands for K- nearest neighbor. It is one of the simplest supervised machine learning algorithms. It is a classification type of machine learning algorithm.

We have covered various other algorithms like logistic regression, decision tree, and random forest under the classification type of machine learning algorithms. The link to their blogs are given below:

Logistic Regression: https://ainewgeneration.com/logistic-regression/

Decision Tree: https://ainewgeneration.com/decision-tree-in-machine-learning/

Random Forest: https://ainewgeneration.com/random-forest-in-machine-learning/

Table of Contents

  1. What is KNN?
  2. Why do we need this algorithm?
  3. The factor K
  4. Euclidean Distance
  5. Operation of KNN
  6. Implementation on rain prediction dataset

What is KNN?

KNN as introduced is a classification type of supervised machine learning algorithm. It helps to solve classification problems by grouping data points into classes based on their neighbor’s classification. There are various things involved in this algorithm like the factor k, the Euclidean distance which we are going to talk about as we proceed.

Why do we need this algorithm?

K- Nearest Neighbor becomes an important algorithm to be used when we are performing pattern recognition tasks for classifying objects based on different attributes.

Let us suppose, we have a data set that contains information about cats and dogs. Then, we get a new data point that is needed to be checked if that point is a cat or a dog. This type of problem can be easily solved using KNN. We will see this type of problem to understand how KNN operates in our blog as we proceed.

The factor K

The factor K is of great importance in the KNN algorithm. It is a deciding factor in this algorithm.

If we want to understand what this factor actually represents, then we can say that it is a parameter that signifies the number of nearest neighbors which will be considered while majority voting. 

There is no defined way to determine the best value of K. It varies from problem to problem, as well as the business scenario. Usually, 5 is preferred as the value for k. Selecting a K value of one or two can be noisy and may lead to outliers in our generated model, and thus result in overfitting of the model. 

In spite of all the odds, we can still say that to choose the value of K, take the square root of n (sqrt(n)), where n is the total number of data points.

Euclidean Distance

The Euclidean distance between two points in the plane with coordinates (x,y) and (a,b) is actually a measure of the length of perpendicular distance between these. It is the square root of the sums of the squared differences between the two points.

Source: Google Image Search

It can be given by the formula:

Operation of KNN

Let us consider a dataset that contains two variables namely height(cm) and weight(kg). Each point in this dataset is classified as normal or underweight.

Suppose now, a new data point (57, 170) is encountered, then we need to classify it based on this data set. We will use the KNN algorithm to do this.

In order to know about the relation of this data point with its neighbors, we will calculate the Euclidean distance. 

The table shows the calculated values of Euclidean distance of the new unknown data point from all the points.

We will choose the values of k as 3 as there are 9 data points already present in our given data set and sqrt(9)=3. 

According to the table with Euclidean distances calculated, we will observe the three data points with the least value for Euclidean distance. We see that all the nearest neighbors of the new data point are classified as Normal, so needless to say that it will also be classified as Normal only.

Thus, the data point (57, 170) should be normal.

Implementation on rain prediction dataset

First things firsts. As we do in our every project, our foremost task is to import all the necessary libraries that are going to be used in our code.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score,mean_squared_error

Post the import of necessary libraries, we will read our dataset which is a comma- separated value form(csv file) and will convert that into a panda data frame named df.

df= pd.read_csv(r"C:\Users\Projects\Rain_prediction_Dataset.csv")

Next we will print the first 5 entries of our data frame to understand its attributes and our target variable in a better way.

df.head()

We are next using the describe function to see various parameters of our dataset.

df.describe()

The info() function tells us about the no. of non null values and also the data type of each feature.

df.info()

output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24196 entries, 0 to 24195
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           24196 non-null  int64  
 1   Location       24196 non-null  int64  
 2   MinTemp        24196 non-null  float64
 3   MaxTemp        24196 non-null  float64
 4   Rainfall       24196 non-null  float64
 5   Evaporation    24196 non-null  float64
 6   Sunshine       24196 non-null  float64
 7   WindGustDir    24196 non-null  int64  
 8   WindGustSpeed  24196 non-null  int64  
 9   WindDir9am     24196 non-null  int64  
 10  WindDir3pm     24196 non-null  int64  
 11  WindSpeed9am   24196 non-null  int64  
 12  WindSpeed3pm   24196 non-null  int64  
 13  Humidity9am    24196 non-null  int64  
 14  Humidity3pm    24196 non-null  int64  
 15  Pressure9am    24196 non-null  float64
 16  Pressure3pm    24196 non-null  float64
 17  Cloud9am       24196 non-null  int64  
 18  Cloud3pm       24196 non-null  int64  
 19  Temp9am        24196 non-null  float64
 20  Temp3pm        24196 non-null  float64
 21  RainToday      24196 non-null  int64  
 22  RainTomorrow   24196 non-null  int64  
dtypes: float64(9), int64(14)
memory usage: 4.2 MB

Now, we are using the value_counts() function to analyze the Cloud9am feature. We find that this is not a categorical variable. It represents the fraction of sky overcast at 9am. Similarly, analyzing each of the features we find that apart from RainTomorrow(our target variable) and RainToday, none of the attributes are categorical, and these two are already in 0 and 1 form. Thus, we conclude that one hot encoding is not required here.

df["Cloud9am"].value_counts()

output:
7    7089
1    3683
8    3357
6    2397
3    1665
2    1649
0    1624
5    1581
4    1151
Name: Cloud9am, dtype: int64

Next, we are determining our independent and dependent variables as you can see in the code below clearly.

x=df.iloc[:,:-1].values
y=df.iloc[:,22]

Then, just as we do for every machine learning project, we are splitting the dataset into test and train sets in 80: 20 ratio.

x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.2)

After this we are investigating the shape i.e. no. of rows and columns of our test and train dataset to check if the dataset has been split.

x_train.shape, x_test.shape

output:
(1157, 21), (570, 21))

Next, we are scaling our training and test data in the range of zero to one for increasing the accuracy otherwise machine learning will consider, greater value to be of higher importance.

scaler=StandardScaler()
scaler.fit(x_train)
x_train=scaler.transform(x_train)
x_test=scaler.transform(x_test)

In this dataset, RainTomorrow is our target variable as you might have understood. So, now we are training our model on the dataset by using K-nearest neighbor as you can see in the code snippet below.

from sklearn.neighbors import KNeighborsClassifier 
cls=KNeighborsClassifier(n_neighbors=5)
cls.fit(x_train, y_train)

output:
KNeighborsClassifier()

After, training our model our next task is to predict the test values as we do in every machine learning problem.

y_pred=cls.predict(x_test)

After carrying out the prediction, we are now evaluating the accuracy of our model by using the MSE metric. We had already imported this metric from sklearn.metric module in the beginning of the implementation.

mean_squared_error(y_test, y_pred)

output:
0.21177685950413222

Finally, we are finding out the accuracy score of the prediction done by our model.

accuracy_score(y_test, y_pred)

output:
0.7882231404958677

End Notes

I hope this blog gave you an insightful idea of the KNN algorithm. We will cover more such insightful blogs in the future.

  • Share:
author avatar
Natasha

Previous post

Python Objects And Classes
September 9, 2021

Next post

Types of Variables & Methods in Python OOPs
September 9, 2021

You may also like

nbc1
Naive Bayes in Machine Learning
28 September, 2021
featured (1)
Gradient Boosting In Machine Learning
19 September, 2021
support-vector-machine-cover
SVM in Machine Learning
15 September, 2021

Leave A Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Garbage Classification using CNN Model
  • Brain Tumor Prediction PyTorch[CNN]
  • Covid-19 X-ray prediction using Deep Learning
  • Data Analysis for Olympic 2021
  • Naive Bayes in Machine Learning

Categories

  • Data Science
  • Deep Learning
  • Machine Learning
  • Python

Archives

  • December 2021
  • November 2021
  • September 2021
  • August 2021
  • July 2021

(+91) 844 745 8168

[email protected]

COMPANY

  • Blog
  • Our Services
  • Contact Us

LINKS

  • Home
  • Blog
  • Activity
  • Checkout

RECOMMEND

  • Cart
  • Members
  • Sample Page
  • Shop

SUPPORT

  • Members
  • My account
  • Register
  • Shop

Copyright © 2021 AI New Generation

Become an instructor?

Join thousand of instructors and earn money hassle free!

Get started now

Login with your site account

Lost your password?

Not a member yet? Register now

Register a new account

Are you a member? Login now