22. K-Means Clustering

Introduction to Cluster

Have you ever noticed that customers in a mall don’t behave the same?”
Some people earn high salary but spend less,
some earn less but spend more,
and some are in between.

Can we automatically group similar customers without telling the computer any labels?

That’s exactly what clustering does.
And the most popular clustering algorithm is K-Means.

What is Clustering?

Clustering is a technique used in data mining and machine learning to group similar data points together. In clustering, data points within the same group (called a cluster) are more similar to each other than to those in other groups.
It is an unsupervised learning method, meaning the data does not have predefined labels.

The main goal of clustering is to discover hidden patterns or structures in data by organizing it into meaningful groups.

Example,1 of Clustering

Example: Student Study Hours

Suppose we have data about students based on the number of hours they study per day:

If we apply clustering, the algorithm may group students as follows:

Cluster 1 (Low study hours): A, B, C
Cluster 2 (High study hours): D, E, F

Here:

Students in the same cluster have similar study habits
No labels like “low” or “high” were given in advance
The algorithm identified the groups automatically

This is an example of how clustering helps organize data based on similarity.

Example 2- of Clustering Using Salary and Spending Score

Imagine a Mall Manager Problem

“You are the manager of a shopping mall.”

You have customer data:

Salary
Spending Score

You want to:

Give discounts
Send offers
Identify premium customers

“Should we treat all customers the same?”

No!
Some earn more, some spend more, some are window shoppers.

So we need to GROUP customers automatically.

This grouping without labels is called → Clustering

And the most famous clustering algorithm is K-Means.

Consider a shopping mall that collects data about customers based on two features:

Annual Salary (₹ in lakhs)
Spending Score (a value from 1 to 100 showing how much a customer spends)

When clustering is applied to this dataset:

Cluster 1: Low Salary – Low Spending

Customers: C1, C2, C3
Characteristics:
- Lower income
- Low spending behavior

Cluster 2: High Salary – High Spending

Customers: C4, C5, C6
Characteristics:
- Higher income
- High spending behavior

Explanation

Customers within the same cluster have similar salary and spending patterns
No labels (such as “low spender” or “high spender”) were given initially
The clustering algorithm automatically grouped customers based on similarity
This helps businesses identify different customer segments

Why This Example Is Important

Helps in customer segmentation
Supports targeted marketing
Improves business strategies

K-Means Clustering

K-Means is an unsupervised clustering algorithm used to group similar data points.
It is called a centroid-based algorithm because each cluster is represented by a centroid (mean value).
The main aim of K-Means is to divide a dataset into K clusters, where K is a predefined number.

How K-Means Clustering Works

K-Means clustering is an unsupervised machine learning algorithm that groups similar data points into a predefined number of clusters, denoted by K. The algorithm works through a series of iterative steps to form meaningful clusters based on distance and similarity.

1. Initialization

The process begins by selecting K data points randomly from the dataset. These selected points serve as the initial centroids, which represent the centers of the clusters.

2. Assignment

For each data point in the dataset, the distance to every centroid is calculated using a distance measure such as Euclidean distance. Each data point is then assigned to the cluster whose centroid is the closest. This step results in the formation of K initial clusters.

3. Centroid Update

After all data points are assigned, the centroids are updated by computing the mean of all data points within each cluster. The newly calculated mean becomes the updated centroid for that cluster.

4. Iteration and Convergence

The assignment and centroid update steps are repeated multiple times. With each iteration, the centroids move closer to the center of their respective clusters. The algorithm continues until convergence, which occurs when the centroids stop changing significantly or when a predefined number of iterations is reached.

5. Final Result

Once convergence is achieved, the algorithm outputs the final cluster centroids along with the cluster assignment for each data point. The dataset is now partitioned into well-defined groups based on similarity.

What is a Centroid?

A centroid is the center point of a cluster in clustering algorithms like K-Means.

It represents the average position of all data points belonging to that cluster.

Centroid :

Centroid is the mean (average) of all data points in a cluster and is used to represent the cluster.

Mathematical Definition

If a cluster has n points:

y1,y2,…,yn

Then the centroid is:

Cx=x1+x2++xnn

Cy=y1+y2++ynn

Example

Cluster C1 contains points:

P1 (2,15)
P2 (3,18)
P3 (4,12)
P7 (4,16)
P8 (3,14)

Centroid calculation:

Cx=2+3+4+4+35=3.2

Cy=15+18+12+16+145=15

Centroid = (3.2, 15)

Why Centroid is Important in K-Means?

It represents the cluster center
Used to assign nearest data points
Updated repeatedly until it stops changing
Helps minimize within-cluster distance

Distance Measures in K-Means

Euclidean Distance (Most Common)

Measures the straight-line distance between two points.
Works best for numerical data.
Most widely used distance measure in K-Means.

Formula (for two features):

d=(x1-x2)2+(y1-y2)2

Use case:
When data is continuous and features are on a similar scale.

Manhattan Distance

Measures distance as the sum of absolute differences.
Also called city-block distance.
Movement is only in horizontal and vertical directions.

Formula:

d=∣x1-x2∣+∣y1-y2∣

Use case:
Useful when data has grid-like structure.

22. K-Means Clustering

Introduction to Cluster

What is Clustering?

K-Means Clustering

Distance Measures in K-Means

Comments

More from this blog

26.Text Processing in NLP

24.Introduction of NLP

23.2 From Brain Neurons → to Artificial Neurons

23.1 Neural Networks

22.3 Illustration of K-Means Clustering Using a Sample Dataset

Command Palette

Introduction to Cluster

What is Clustering?

K-Means Clustering

Distance Measures in K-Means

Comments

More from this blog