DBSCAN Algorithm of Unsupervised Machine Learning
1️. Introduction to DBSCAN
DBSCAN stands for Density-Based
Spatial Clustering of Applications with Noise.
It is an unsupervised
machine learning clustering algorithm used to group data points based on density
(i.e., how closely packed data points are within a specific region) rather
than distance from a centroid (like K-Means).
DBSCAN is
especially powerful for:
✔ Detecting clusters of arbitrary shapes (clusters that are not
necessarily circular or uniform in structure)
✔ Identifying outliers (noise)
✔ Works well with real-world noisy data (noisy data means data
containing errors, random variations, irrelevant values, or extreme outliers
that do not belong to any meaningful pattern)
2️. Why
DBSCAN is Important?
Unlike
traditional clustering algorithms such as K-Means, DBSCAN:
✔ Does NOT require the number of clusters in advance
✔ Can detect non-linear cluster shapes
✔ Automatically detects noise
✔ Performs well when data contains irregular distributions
3️. Core
Concept Behind DBSCAN
DBSCAN is
based on one simple idea:
"Clusters
are dense regions of data points separated by low-density regions."
Here, density
means the number of data points present within a defined neighborhood
(radius) around a particular point.
It uses
two important parameters:
1. Epsilon (ε)
- Radius of neighborhood
around a data point.
- Defines how close points
must be to be considered neighbors.
2. MinPts (Minimum Points)
- Minimum number of points
required inside ε-radius to form a dense region.
4️. Key
Terminologies in DBSCAN
There are
3 types of points:
1. Core Point
A point
that has at least MinPts points within ε radius.
2. Border Point
A point
that has fewer than MinPts but lies within ε of a core point.
3. Noise (Outlier)
A point
that is neither core nor border.
5️.
Visual
Understanding of DBSCAN
From the images above, you can observe:
·
Clusters are not circular
·
Some points are labeled as noise
·
Arbitrary-shaped
clusters (clusters that can take any irregular form instead of fixed shapes
like circles) are formed
6️. How DBSCAN Works
(Step-by-Step)
Step 1:
Select an unvisited point.
Step 2:
Find all points within ε distance.
Step 3:
If neighbors ≥ MinPts → mark as Core Point and
start cluster.
Step 4:
Expand cluster recursively by checking
neighbors of neighbors.
Step 5:
If neighbors < MinPts → mark as Noise
(temporarily).
Step 6:
Repeat until all points are processed.
7️. Logic Behind the Algorithm
The main logic is:
🔹 Density Reachability
A point A is density reachable from B if:
·
B is a core point
·
A lies within ε of B
🔹 Density Connectivity
Two points are density connected if there
exists a chain of density-reachable points connecting them.
This recursive connectivity builds clusters.
8. Distance Metrics Used in DBSCAN
DBSCAN commonly uses:
·
Euclidean Distance
·
Manhattan Distance
·
Minkowski Distance
·
Haversine Distance (for geo data)
Distance selection depends on:
·
Nature of data
·
Feature scaling
·
Dimensionality
9. When to Use DBSCAN?
Use DBSCAN when:
✔ You don’t know number of
clusters
✔ Clusters are irregular in shape
✔ Data contains noise/outliers
✔ Working with spatial/geographical data
✔ Fraud detection
✔ Anomaly detection
Avoid DBSCAN when:
·
Dataset is very high dimensional
·
Varying density clusters exist
Real-Life Applications of DBSCAN
📍 1. Geographical Data
Clustering
·
Grouping houses in a city
·
Earthquake hotspot detection
·
Traffic analysis
📊 2. Fraud Detection
·
Detect abnormal banking transactions
📡 3. Image Processing
·
Object segmentation
🧬 4. Medical Data
·
Gene clustering
·
Disease pattern detection
🚗 5. Autonomous Vehicles
·
Obstacle detection
Problem Statement : (for DBSCAN on Iris dataset):
To apply
the DBSCAN clustering algorithm on the Iris flower dataset using Sepal and
Petal measurements in order to identify natural density-based clusters without
predefining the number of clusters.
To
evaluate whether DBSCAN can effectively distinguish different Iris species and
detect any noise or overlapping patterns based on feature similarity.
Download Dataset - Iris Dataset
Dataset Description
• The dataset contains measurements of Iris
flowers.
• It includes four numerical features: Sepal
Length, Sepal Width, Petal Length, and Petal Width.
• All measurements are recorded in centimeters.
• Each row in the dataset represents one
individual flower sample.
• The dataset also includes one categorical
variable called Species.
• The Species variable has three classes:
setosa, versicolor, and virginica.
• The numerical features describe the
morphological (physical) characteristics of the flowers.
• The dataset is widely used for clustering and
classification analysis.
• It
helps analyze how different flower species can be grouped based on similarity
in their measurements.
Go to Jupyter Notebook to use DBSCAN Algorithm
0 टिप्पण्या
कृपया तुमच्या प्रियजनांना लेख शेअर करा आणि तुमचा अभिप्राय जरूर नोंदवा. 🙏 🙏