K-Means Clustering is an unsupervised learning algorithm that finds its uses for clustering problems in machine learning. This algorithm basically groups the unlabeled dataset into different clusters.
Table of Contents
- What is K-Means Clustering?
- How does K-Means Clustering work?
- How to implement K-Means Clustering?
What is K-Means Clustering
K-means clustering is basically a clustering method where data points are assigned into K clusters or groups. It divides the unlabeled dataset into K different clusters in such a way that each data points belong only in one group that shares similar characteristics. K is based on the distance from each group’s centroid. For instance, if K=2, there will be two clusters; and for K=3, there will be three clusters, and so on.
The data points closest to a given centroid will be clustered under the same category. This algorithm is iterative in nature.
K-means clustering finds its usage mostly in market segmentation, document clustering, image segmentation, and image compression.
How does K-Means Clustering work
The k-means clustering algorithm primarily executes the following two tasks:
- Determining the best value for K-center points or centroids through an iterative process
- Assigns each data point to its closest K-center. Those data points which are near to the particular K-center, cumulatively create a cluster.
Hence each of these cluster has data points which share similar commonalities, and each of these clusters are away from each other.
The below pictorial representation explains the working of the K-means Clustering Algorithm –
How to implement K- Means Clustering
In below points we have tried to breakdown the ways about implementing the algorithm :-
- Select the number K to determine the number of clusters
- Choose random K points or centroids.
- Match each data point to the nearest centroid, which will set up the predefined K clusters.
- Compute the variance and put a new centroid of each cluster.
- Reassign each data point to the new nearest centroid of each cluster.
- If the assignment is changed, go to step 4, otherwise, finish.
- The model is completed
To substantiate the above steps, we will use the Weather data set to carry out classification by implementing K-Means. You can find the data set below for your reference.
Let’s do the opening with importing the libraries and then some data processing which are few
import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
Once done importing necessary libraries, we need to read & laod our dataset which is in comma-separated value form(csv. file) and will convert that into a panda data frame named df.
We will give a check of our data frame by printing the first five entries of it. The code & image are the following.
Now, we will use the info() function to check for the data types & null values in our data set.
In the above image, when you get to see the non-null under the Count column, it implies that there has been no null value for that Column
Let’s check for the missing values and also drop them if the count of missing values are low in count
Since the count of missing values are low in count, we can drop them
After seeing the data, it has been understood that we can drop the first two columns as they don’t contribute any insights or inferences
df= df.iloc[:,2:] df
After fixing the range of columns we need to design our model, we need to jump on the scaling technique. The reaon for scaling in machine learning has been discussed previously, one can look into that if its not still clear
sc= StandardScaler() df_scale=sc.fit_transform(df)
We are done with all the pre-processing in all the above steps & hence from here on we will create the K-Means model
from sklearn.cluster import KMeans model = KMeans(7) model=model.fit(df_scale)
In the above step, we have mentioned 7 against KMeans but that’s not something any hard & fast rule, one can give any number of feasible cluster one prefers
In the above steps we clustered the weather data into different clusters based on their attributes/features.
Now, we will calulcate the inertia in the below step. Inertia measures how well a dataset was clustered by K-Means.
More the inertia value is less the better it is
Now, we will finally do the plotting of the Elbow method. It is one of the most popular methods used to select the optimal number of clusters by fitting the model with a range of values for K in the K-means algorithm
df_elbow= for i in range(1,15): model = KMeans(i) model.fit(df_scale) df_elbow.append(model.inertia_) plt.plot(range(1,15), df_elbow)
The elbow point represents the point where SSE or inertia starts decreasing in a linear manner. In the above fig , you may note that it is no. of clusters = 7 where the SSE starts decreasing in a linear manner.
Hopefully, we have been able to shed some light on how K-means works and how to implement it in Python. Going ahead we will also discuss how to combine PCA and K-means clustering.