22. K-Means Clustering

Introduction to Cluster
Have you ever noticed that customers in a mall don’t behave the same?”
Some people earn high salary but spend less,
some earn less but spend more,
and some are in between.
Can we automatically group similar customers without telling the computer any labels?
That’s exactly what clustering does.
And the most popular clustering algorithm is K-Means.
What is Clustering?
Clustering is a technique used in data mining and machine learning to group similar data points together. In clustering, data points within the same group (called a cluster) are more similar to each other than to those in other groups.
It is an unsupervised learning method, meaning the data does not have predefined labels.
The main goal of clustering is to discover hidden patterns or structures in data by organizing it into meaningful groups.
Example,1 of Clustering
Example: Student Study Hours
Suppose we have data about students based on the number of hours they study per day:
If we apply clustering, the algorithm may group students as follows:
Cluster 1 (Low study hours): A, B, C
Cluster 2 (High study hours): D, E, F
Here:
Students in the same cluster have similar study habits
No labels like “low” or “high” were given in advance
The algorithm identified the groups automatically
This is an example of how clustering helps organize data based on similarity.
Example 2- of Clustering Using Salary and Spending Score
Imagine a Mall Manager Problem
“You are the manager of a shopping mall.”
You have customer data:
Salary
Spending Score
You want to:
Give discounts
Send offers
Identify premium customers
“Should we treat all customers the same?”
No!
Some earn more, some spend more, some are window shoppers.
So we need to GROUP customers automatically.
This grouping without labels is called → Clustering
And the most famous clustering algorithm is K-Means.
Consider a shopping mall that collects data about customers based on two features:
Annual Salary (₹ in lakhs)
Spending Score (a value from 1 to 100 showing how much a customer spends)
When clustering is applied to this dataset:
Cluster 1: Low Salary – Low Spending
Customers: C1, C2, C3
Characteristics:
Lower income
Low spending behavior
Cluster 2: High Salary – High Spending
Customers: C4, C5, C6
Characteristics:
Higher income
High spending behavior
Explanation
Customers within the same cluster have similar salary and spending patterns
No labels (such as “low spender” or “high spender”) were given initially
The clustering algorithm automatically grouped customers based on similarity
This helps businesses identify different customer segments
Why This Example Is Important
Helps in customer segmentation
Supports targeted marketing
Improves business strategies
K-Means Clustering
K-Means is an unsupervised clustering algorithm used to group similar data points.
It is called a centroid-based algorithm because each cluster is represented by a centroid (mean value).
The main aim of K-Means is to divide a dataset into K clusters, where K is a predefined number.
How K-Means Clustering Works
K-Means clustering is an unsupervised machine learning algorithm that groups similar data points into a predefined number of clusters, denoted by K. The algorithm works through a series of iterative steps to form meaningful clusters based on distance and similarity.
1. Initialization
The process begins by selecting K data points randomly from the dataset. These selected points serve as the initial centroids, which represent the centers of the clusters.
2. Assignment
For each data point in the dataset, the distance to every centroid is calculated using a distance measure such as Euclidean distance. Each data point is then assigned to the cluster whose centroid is the closest. This step results in the formation of K initial clusters.
3. Centroid Update
After all data points are assigned, the centroids are updated by computing the mean of all data points within each cluster. The newly calculated mean becomes the updated centroid for that cluster.
4. Iteration and Convergence
The assignment and centroid update steps are repeated multiple times. With each iteration, the centroids move closer to the center of their respective clusters. The algorithm continues until convergence, which occurs when the centroids stop changing significantly or when a predefined number of iterations is reached.
5. Final Result
Once convergence is achieved, the algorithm outputs the final cluster centroids along with the cluster assignment for each data point. The dataset is now partitioned into well-defined groups based on similarity.
What is a Centroid?
A centroid is the center point of a cluster in clustering algorithms like K-Means.
It represents the average position of all data points belonging to that cluster.
Centroid :
Centroid is the mean (average) of all data points in a cluster and is used to represent the cluster.
Mathematical Definition
If a cluster has n points:
y1,y2,…,yn
Then the centroid is:
Cx=x1+x2++xnn
Cy=y1+y2++ynn
Example
Cluster C1 contains points:
P1 (2,15)
P2 (3,18)
P3 (4,12)
P7 (4,16)
P8 (3,14)
Centroid calculation:
Cx=2+3+4+4+35=3.2
Cy=15+18+12+16+145=15
Centroid = (3.2, 15)
Why Centroid is Important in K-Means?
It represents the cluster center
Used to assign nearest data points
Updated repeatedly until it stops changing
Helps minimize within-cluster distance
Distance Measures in K-Means
Euclidean Distance (Most Common)
Measures the straight-line distance between two points.
Works best for numerical data.
Most widely used distance measure in K-Means.
Formula (for two features):
d=(x1-x2)2+(y1-y2)2
Use case:
When data is continuous and features are on a similar scale.
Manhattan Distance
Measures distance as the sum of absolute differences.
Also called city-block distance.
Movement is only in horizontal and vertical directions.
Formula:
d=∣x1-x2∣+∣y1-y2∣
Use case:
Useful when data has grid-like structure.




