Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a type of unsupervised learning in machine learning. Specifically, it is a dimensionality reduction technique used to reduce the number of features in a dataset while retaining as much of the original variability (information) as possible.


Key Points About PCA:

Unsupervised Learning: PCA does not require labeled data. It simply analyzes the structure of the data to identify the directions (principal components) that capture the most variance.


Dimensionality Reduction: PCA transforms the original features into a new set of features (principal components), which are linear combinations of the original features. These new features are ordered by the amount of variance they capture from the data.


Purpose: PCA is often used to reduce the complexity of data, make visualization easier, and improve the performance of other machine learning algorithms by removing noise and redundancy.

Problem Statement for PCA on University Data

The goal of this study is to apply Principal Component Analysis (PCA) to a dataset of university performance metrics to identify the most significant factors that explain the variance in university performance. By reducing the dimensionality of the dataset, we aim to simplify the analysis while retaining the most critical information.


Dataset Description:

The dataset includes various metrics for different universities, such as SAT scores, the percentage of students in the top 10% of their high school class, acceptance rate, student-faculty ratio, expenses, and graduation rate.




Univ: The name of the university (Categorical, used for identification, not included in PCA).

SAT: Average SAT score of admitted students.

Top10: Percentage of students in the top 10% of their high school class.

Accept: Acceptance rate (percentage of applicants admitted).

SFRatio: Student-faculty ratio.

Expenses: Annual expenses per student.

GradRate: Graduation rate (percentage of students who graduate).



Data Standardization: 

Before applying PCA, standardize the continuous variables to ensure that each variable contributes equally to the analysis.

PCA Application: 

Perform PCA on the standardized data to extract the principal components.

Variance Explanation: 

Analyze the explained variance for each principal component to determine the number of components to retain.

Component Interpretation:

Interpret the principal components to understand the combination of original variables they represent.


Data Standardization:

The line of code uni_normal = scale(UNI) is commonly used in data preprocessing for standardization. Here's a detailed explanation of what this code does:



UNI: This represents a dataset or variable, typically a numeric matrix or data frame.

scale(): This is a function from various data analysis libraries (like base R or sklearn in Python) used for standardizing the data.


Standardization: The scale() function standardizes the data by transforming it to have a mean of 0 and a standard deviation of 1. This process is also known as z-score normalization. Standardization is useful for many machine learning algorithms, which assume that the data is centered around 0 and scaled.


How It Works:


Mean Calculation: The function calculates the mean of each feature (column) in the dataset UNI.

Standard Deviation Calculation: It calculates the standard deviation of each feature.

Transformation: Each value in the dataset is then transformed using the formula:

  • z=xmeanstd_devz = \frac{x - \text{mean}}{\text{std\_dev}}
  • where

    x is the original value,

    mean is the mean of the feature, and

    std_dev is the standard deviation of the feature.

    Result: The result, uni_normal, is a new dataset where each feature has been standardized. The values now represent the number of standard deviations away from the mean.

