Principal Component Analysis (PCA)
Principal
Component Analysis (PCA) is a type of unsupervised learning in machine
learning. Specifically, it is a dimensionality reduction technique used to
reduce the number of features in a dataset while retaining as much of the
original variability (information) as possible.
Key Points About PCA:
Unsupervised
Learning: PCA does not require labeled data. It simply analyzes the structure
of the data to identify the directions (principal components) that capture the
most variance.
Dimensionality
Reduction: PCA transforms the original features into a new set of features
(principal components), which are linear combinations of the original features.
These new features are ordered by the amount of variance they capture from the
data.
Purpose:
PCA is often used to reduce the complexity of data, make visualization easier,
and improve the performance of other machine learning algorithms by removing
noise and redundancy.
Problem
Statement for PCA on University Data
Title:
Dimensionality Reduction in University Performance Metrics Using Principal
Component Analysis
Objective:
The
goal of this study is to apply Principal Component Analysis (PCA) to a dataset
of university performance metrics to identify the most significant factors that
explain the variance in university performance. By reducing the dimensionality
of the dataset, we aim to simplify the analysis while retaining the most
critical information.
Dataset
Description:
The
dataset includes various metrics for different universities, such as SAT
scores, the percentage of students in the top 10% of their high school class,
acceptance rate, student-faculty ratio, expenses, and graduation rate.
Variables:
Univ:
The name of the university (Categorical, used for identification, not included
in PCA).
SAT:
Average SAT score of admitted students.
Top10:
Percentage of students in the top 10% of their high school class.
Accept:
Acceptance rate (percentage of applicants admitted).
SFRatio:
Student-faculty ratio.
Expenses:
Annual expenses per student.
GradRate:
Graduation rate (percentage of students who graduate).
Methodology:
Data Standardization:
Before applying PCA, standardize the continuous variables to
ensure that each variable contributes equally to the analysis.
PCA Application:
Perform PCA on the standardized data to extract the principal
components.
Variance Explanation:
Analyze the explained variance for each principal component to
determine the number of components to retain.
Component Interpretation:
Interpret the principal components to understand the
combination of original variables they represent.
Data Standardization:
The
line of code uni_normal = scale(UNI) is commonly used in data preprocessing for
standardization. Here's a detailed explanation of what this code does:
Context
UNI:
This represents a dataset or variable, typically a numeric matrix or data
frame.
scale():
This is a function from various data analysis libraries (like base R or sklearn
in Python) used for standardizing the data.
Explanation
Standardization:
The scale() function standardizes the data by transforming it to have a mean of
0 and a standard deviation of 1. This process is also known as z-score
normalization. Standardization is useful for many machine learning algorithms,
which assume that the data is centered around 0 and scaled.
How
It Works:
Mean
Calculation: The function calculates the mean of each feature (column) in the
dataset UNI.
Standard
Deviation Calculation: It calculates the standard deviation of each feature.
Transformation: Each value in the dataset is then transformed using the formula:
where
x
is the original value,
mean
is the mean of the feature, and
std_dev
is the standard deviation of the feature.
Result:
The result, uni_normal, is a new dataset where each feature has been
standardized. The values now represent the number of standard deviations away
from the mean.
Download Dataset from Here
PCA : Jupyter Notebook
0 टिप्पण्या
कृपया तुमच्या प्रियजनांना लेख शेअर करा आणि तुमचा अभिप्राय जरूर नोंदवा. 🙏 🙏