Introduction to Hierarchical Clustering

 

Introduction to Hierarchical Clustering

🔹 Definition

Hierarchical Clustering is an unsupervised machine learning technique used to group similar data points into clusters by building a tree-like structure called a dendrogram.

🔹 Key Idea

Instead of fixing the number of clusters in advance, hierarchical clustering:

  • Creates clusters step-by-step
  • Shows how clusters merge or split
  • Provides a visual representation of cluster relationships

🔹 Important Characteristics

  • Does not require pre-defining K (number of clusters)
  • Based on distance (similarity) measures
  • Produces interpretable results
  • Suitable for small to medium datasets

2️. Purpose of Hierarchical Clustering

Hierarchical clustering is used to:

Discover Natural Groupings

Identify hidden patterns in data without labels.

Understand Cluster Relationships

Shows how data points are related at different levels.

Exploratory Data Analysis

Helps in research and academic analysis.

Determine Optimal Number of Clusters

Dendrogram helps visually decide the best cluster count.

Support Decision Making

Useful in:

  • University performance segmentation
  • Healthcare patient grouping
  • Market segmentation
  • Institutional comparison

3️. Techniques Used in Hierarchical Clustering

Hierarchical clustering has two main techniques:


🔹 1. Agglomerative Hierarchical Clustering (Bottom-Up Approach)

  • Start with each data point as its own cluster
  • Merge closest clusters step-by-step
  • Continue until all points form one cluster

Most commonly used method.


🔹 2. Divisive Hierarchical Clustering (Top-Down Approach)

  • Start with all points in one cluster
  • Split clusters recursively
  • Continue splitting until each point becomes separate

Less frequently used.


4️. Distance Measures Used

Distance determines similarity between data points.

Common Distance Metrics:

  • Euclidean Distance
  • Manhattan Distance
  • Cosine Similarity
  • Correlation Distance

5️. Linkage Methods (Cluster Merging Criteria)

Linkage determines how distance between clusters is calculated.

🔹 Single Linkage

Minimum distance between clusters.

🔹 Complete Linkage

Maximum distance between clusters.

🔹 Average Linkage

Average distance between clusters.

🔹 Ward’s Method

Minimizes within-cluster variance.
Most preferred in research and academic analysis.


Hierarchical Clustering Diagram – University Dataset Example

(Using features like SAT, Top10, Expenses, GradRate)

 


How to Identify Clusters in a Dendrogram

1️. Look at the vertical lines (height shows distance).
2️. Find the biggest vertical gap (largest jump in height).
3️. Draw a horizontal line across that gap.
4.  Count how many vertical branches the line cuts.

That number = Number of clusters.



Draw a horizontal line at the largest height gap in the dendrogram and count how many branches it cuts — that gives the number of clusters.

 So in above Dendrogram total no. of Clusters are = 7  

Download University Dataset 

Download Hierarchical Clustering Model



 

टिप्पणी पोस्ट करा

0 टिप्पण्या