• Home
  • Blog
  • Our Services
  • Contact Us
  • Register
Have any question?

(+91) 844 745 8168
[email protected]”>
RegisterLogin
AI Next Generation
  • Home
  • Blog
  • Our Services
  • Contact Us
  • Register

Python

  • Home
  • Blog
  • Python
  • K-Means Clustering in Machine Learning

K-Means Clustering in Machine Learning

  • Posted by Subhasish
  • Date September 19, 2021
  • Comments 0 comment

 

Introduction

K-Means Clustering is an unsupervised learning algorithm that finds its uses for clustering problems in machine learning.  This algorithm basically groups the unlabeled dataset into different clusters. 

Table of Contents

  1. What is K-Means Clustering?
  2. How does K-Means Clustering work?
  3. How to implement K-Means Clustering?

What is K-Means Clustering

K-means clustering is basically a clustering method where data points are assigned into K clusters or groups. It divides the unlabeled dataset into K different clusters in such a way that each data points belong only in one group that shares similar characteristics. K is based on the distance from each group’s centroid. For instance, if K=2, there will be two clusters; and for K=3, there will be three clusters, and so on.

The data points closest to a given centroid will be clustered under the same category. This algorithm is iterative in nature. 

K-means clustering finds its usage mostly in market segmentation, document clustering, image segmentation, and image compression.

How does K-Means Clustering work

The k-means clustering algorithm primarily executes the following two tasks:

  • Determining the best value for K-center points or centroids through an iterative process
  • Assigns each data point to its closest K-center. Those data points which are near to the particular K-center, cumulatively create a cluster.

Hence each of these cluster has data points which share similar commonalities, and each of these clusters are away from each other.

The below pictorial representation explains the working of the K-means Clustering Algorithm –

How to implement K- Means Clustering

In below points we have tried to breakdown the ways about implementing the algorithm :- 

  1. Select the number K to determine the number of clusters
  2. Choose random K points or centroids.
  3. Match each data point to the nearest centroid, which will set up the predefined K clusters.
  4. Compute the variance and put a new centroid of each cluster.
  5. Reassign each data point to the new nearest centroid of each cluster.
  6. If the assignment is changed, go to step 4, otherwise, finish.
  7. The model is completed

To substantiate the above steps, we will use the Weather data set to carry out classification by implementing K-Means. You can find the data set below for your reference.

Weather Data : https://drive.google.com/file/d/1soldB4xBmH3j3WBLsYaR_vvVpZFnpfDR/view?usp=sharing

Let’s do the opening with importing the libraries and then some data processing which are few

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Once done importing necessary libraries, we need to read & laod our dataset which is in comma-separated value form(csv. file) and will convert that into a panda data frame named df.

df=pd.read_csv(r"C:\Learning\python_class\minute_weather.csv")

We will give a check of our data frame by printing the first five entries of it. The code & image are the following.

df.head()

Now, we will use the info() function to check for the data types & null values in our data set.

df.info()

In the above image, when you get to see the non-null under the Count column, it implies that there has been no null value for that Column

Let’s check for the missing values and also drop them if the count of missing values are low in count

df.isnull().sum()

Since the count of missing values are low in count, we can drop them

df=df.dropna()
df.isnull().sum()

After seeing the data, it has been understood that we can drop the first two columns as they don’t contribute any insights or inferences

df= df.iloc[:,2:]
df

After fixing the range of columns we need to design our model, we need to jump on the scaling technique. The reaon for scaling in machine learning has been discussed previously, one can look into that if its not still clear

sc= StandardScaler()
df_scale=sc.fit_transform(df)
df_scale

We are done with all the pre-processing in all the above steps & hence from here on we will create the K-Means model

from sklearn.cluster import KMeans
model = KMeans(7)
model=model.fit(df_scale)

In the above step, we have mentioned 7 against KMeans but that’s not something any hard & fast rule, one can give any number of feasible cluster one prefers

model.cluster_centers_
model.labels_
df["cluster"]=model.labels_
df.head()
df["cluster"].value_counts()

In the above steps we clustered the weather data into different clusters based on their attributes/features.

Now, we will calulcate the inertia in the below step. Inertia measures how well a dataset was clustered by K-Means.

model.inertia_

More the inertia value is less the better it is

Now, we will finally do the plotting of the Elbow method. It is one of the most popular methods used to select the optimal number of clusters by fitting the model with a range of values for K in the K-means algorithm

df_elbow=[]
for i in range(1,15):
    model = KMeans(i)
    model.fit(df_scale)
    df_elbow.append(model.inertia_)
plt.plot(range(1,15), df_elbow)

The elbow point represents the point where SSE or inertia starts decreasing in a linear manner. In the above fig , you may note that it is no. of clusters = 7 where the SSE starts decreasing in a linear manner.

Hopefully, we have been able to shed some light on how K-means works and how to implement it in Python. Going ahead we will also discuss how to combine PCA and K-means clustering.

  • Share:
author avatar
Subhasish

Previous post

Data Analysis with Pandas
September 19, 2021

Next post

Gradient Boosting In Machine Learning
September 19, 2021

You may also like

gettyimages-1284922959 (6)
Inheritance in Python
10 September, 2021
gettyimages-1284922959 (5)
Types of Variables & Methods in Python OOPs
9 September, 2021
gettyimages-1284922959 (3)
Python Objects And Classes
8 September, 2021

Leave A Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Garbage Classification using CNN Model
  • Brain Tumor Prediction PyTorch[CNN]
  • Covid-19 X-ray prediction using Deep Learning
  • Data Analysis for Olympic 2021
  • Naive Bayes in Machine Learning

Categories

  • Data Science
  • Deep Learning
  • Machine Learning
  • Python

Archives

  • December 2021
  • November 2021
  • September 2021
  • August 2021
  • July 2021

(+91) 844 745 8168

[email protected]

COMPANY

  • Blog
  • Our Services
  • Contact Us

LINKS

  • Home
  • Blog
  • Activity
  • Checkout

RECOMMEND

  • Cart
  • Members
  • Sample Page
  • Shop

SUPPORT

  • Members
  • My account
  • Register
  • Shop

Copyright © 2021 AI New Generation

Become an instructor?

Join thousand of instructors and earn money hassle free!

Get started now

Login with your site account

Lost your password?

Not a member yet? Register now

Register a new account

Are you a member? Login now