KNN in machine learning stands for K- nearest neighbor. It is one of the simplest supervised machine learning algorithms. It is a classification type of machine learning algorithm.
We have covered various other algorithms like logistic regression, decision tree, and random forest under the classification type of machine learning algorithms. The link to their blogs are given below:
Logistic Regression: https://ainewgeneration.com/logistic-regression/
Table of Contents
- What is KNN?
- Why do we need this algorithm?
- The factor K
- Euclidean Distance
- Operation of KNN
- Implementation on rain prediction dataset
What is KNN?
KNN as introduced is a classification type of supervised machine learning algorithm. It helps to solve classification problems by grouping data points into classes based on their neighbor’s classification. There are various things involved in this algorithm like the factor k, the Euclidean distance which we are going to talk about as we proceed.
Why do we need this algorithm?
K- Nearest Neighbor becomes an important algorithm to be used when we are performing pattern recognition tasks for classifying objects based on different attributes.
Let us suppose, we have a data set that contains information about cats and dogs. Then, we get a new data point that is needed to be checked if that point is a cat or a dog. This type of problem can be easily solved using KNN. We will see this type of problem to understand how KNN operates in our blog as we proceed.
The factor K
The factor K is of great importance in the KNN algorithm. It is a deciding factor in this algorithm.
If we want to understand what this factor actually represents, then we can say that it is a parameter that signifies the number of nearest neighbors which will be considered while majority voting.
There is no defined way to determine the best value of K. It varies from problem to problem, as well as the business scenario. Usually, 5 is preferred as the value for k. Selecting a K value of one or two can be noisy and may lead to outliers in our generated model, and thus result in overfitting of the model.
In spite of all the odds, we can still say that to choose the value of K, take the square root of n (sqrt(n)), where n is the total number of data points.
The Euclidean distance between two points in the plane with coordinates (x,y) and (a,b) is actually a measure of the length of perpendicular distance between these. It is the square root of the sums of the squared differences between the two points.
It can be given by the formula:
Operation of KNN
Let us consider a dataset that contains two variables namely height(cm) and weight(kg). Each point in this dataset is classified as normal or underweight.
Suppose now, a new data point (57, 170) is encountered, then we need to classify it based on this data set. We will use the KNN algorithm to do this.
In order to know about the relation of this data point with its neighbors, we will calculate the Euclidean distance.
The table shows the calculated values of Euclidean distance of the new unknown data point from all the points.
We will choose the values of k as 3 as there are 9 data points already present in our given data set and sqrt(9)=3.
According to the table with Euclidean distances calculated, we will observe the three data points with the least value for Euclidean distance. We see that all the nearest neighbors of the new data point are classified as Normal, so needless to say that it will also be classified as Normal only.
Thus, the data point (57, 170) should be normal.
Implementation on rain prediction dataset
First things firsts. As we do in our every project, our foremost task is to import all the necessary libraries that are going to be used in our code.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score,mean_squared_error
Post the import of necessary libraries, we will read our dataset which is a comma- separated value form(csv file) and will convert that into a panda data frame named df.
Next we will print the first 5 entries of our data frame to understand its attributes and our target variable in a better way.
We are next using the describe function to see various parameters of our dataset.
The info() function tells us about the no. of non null values and also the data type of each feature.
df.info() output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 24196 entries, 0 to 24195 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 24196 non-null int64 1 Location 24196 non-null int64 2 MinTemp 24196 non-null float64 3 MaxTemp 24196 non-null float64 4 Rainfall 24196 non-null float64 5 Evaporation 24196 non-null float64 6 Sunshine 24196 non-null float64 7 WindGustDir 24196 non-null int64 8 WindGustSpeed 24196 non-null int64 9 WindDir9am 24196 non-null int64 10 WindDir3pm 24196 non-null int64 11 WindSpeed9am 24196 non-null int64 12 WindSpeed3pm 24196 non-null int64 13 Humidity9am 24196 non-null int64 14 Humidity3pm 24196 non-null int64 15 Pressure9am 24196 non-null float64 16 Pressure3pm 24196 non-null float64 17 Cloud9am 24196 non-null int64 18 Cloud3pm 24196 non-null int64 19 Temp9am 24196 non-null float64 20 Temp3pm 24196 non-null float64 21 RainToday 24196 non-null int64 22 RainTomorrow 24196 non-null int64 dtypes: float64(9), int64(14) memory usage: 4.2 MB
Now, we are using the value_counts() function to analyze the Cloud9am feature. We find that this is not a categorical variable. It represents the fraction of sky overcast at 9am. Similarly, analyzing each of the features we find that apart from RainTomorrow(our target variable) and RainToday, none of the attributes are categorical, and these two are already in 0 and 1 form. Thus, we conclude that one hot encoding is not required here.
df["Cloud9am"].value_counts() output: 7 7089 1 3683 8 3357 6 2397 3 1665 2 1649 0 1624 5 1581 4 1151 Name: Cloud9am, dtype: int64
Next, we are determining our independent and dependent variables as you can see in the code below clearly.
Then, just as we do for every machine learning project, we are splitting the dataset into test and train sets in 80: 20 ratio.
x_train, x_test, y_train, y_test=train_test_split(x,y, test_size=0.2)
After this we are investigating the shape i.e. no. of rows and columns of our test and train dataset to check if the dataset has been split.
x_train.shape, x_test.shape output: (1157, 21), (570, 21))
Next, we are scaling our training and test data in the range of zero to one for increasing the accuracy otherwise machine learning will consider, greater value to be of higher importance.
scaler=StandardScaler() scaler.fit(x_train) x_train=scaler.transform(x_train) x_test=scaler.transform(x_test)
In this dataset, RainTomorrow is our target variable as you might have understood. So, now we are training our model on the dataset by using K-nearest neighbor as you can see in the code snippet below.
from sklearn.neighbors import KNeighborsClassifier cls=KNeighborsClassifier(n_neighbors=5) cls.fit(x_train, y_train) output: KNeighborsClassifier()
After, training our model our next task is to predict the test values as we do in every machine learning problem.
After carrying out the prediction, we are now evaluating the accuracy of our model by using the MSE metric. We had already imported this metric from sklearn.metric module in the beginning of the implementation.
mean_squared_error(y_test, y_pred) output: 0.21177685950413222
Finally, we are finding out the accuracy score of the prediction done by our model.
accuracy_score(y_test, y_pred) output: 0.7882231404958677
I hope this blog gave you an insightful idea of the KNN algorithm. We will cover more such insightful blogs in the future.