Basics, Fun with Statistics, Unsupervised Learning

Mathematics of Principal Component Analysis (PCA)

Understanding Principal Component Analysis

In this part of the article, I will try to explain the mathematics and intuition behind Principal Component Analysis and in the next part, I will show how to implement Principal Component Analysis (PCA) using Python. PCA is an unsupervised machine learning technique which creates a low dimensional representation of a dataset. PCA is used to mathematically transform the variables in the datasets to form principal components which define the maximal variance in the data. These principal components are often a linear combination of the variables and are mutually uncorrelated. So, if you have a dataset with hundreds of variables and many of them are correlated, PCA can turn that dataset to have fewer variables with the highest variance and no multicollinearity between them. Below are some advantages of PCA

  • It is used a pre-processing step before a supervised machine learning technique is carried out
  • It can be used a dimension reduction technique
  • It can be used to find variables with maximum variance
  • It can be used to remove multicollinearity in the data
  • It can also be used for data visualization

When we perform PCA on a dataset, it creates as many principal components as the value of the expression min(m-1, n), where m is the number of observations in the dataset and n is the number of variables in the dataset. The first principal components always have the largest variance, followed by second principal components and so on. But how do they create those principal components? To answer this question let us go back to our school days when we learned rotation of axis

Consider the point (3, 4) in x-y coordinate, what will be its new coordinates if we rotate the axes by 450 angles?

When we rotate the axes by 450 angles, its new coordinates are given by the linear combination of its old co-ordinates

PC1 = X1 cosθ + X2 sinθ

PC2 = -X1 sinθ + X2 cosθ

Something similar goes under the hood when we perform PCA on a dataset. It creates principal components as a linear combination of old variables so that maximum variability is described by the components which were not there earlier. Let’s make it clearer by visualizing a dataset. Following plot shows scatter plot between two predictors X1 and X2

It is clearly seen that the variables are strongly associated (correlated) with each other and maximum variance in the data is explained by the predictor X1 followed by X2. Now let’s see what get after performing PCA on the above data

We see that the almost all the variance in the data after PCA is explained by principal component PC1, thereby reducing the number of important predictor variable from 2 to 1. The variance explained by the principal component, PC 1 is more than what was explained by the predictor X1. Also, there is no association between the principal components PC1 and PC2, i.e. we have also removed the collinearity between the predictor variables. So, by now you must have understood PCA is a very powerful technique for pre-processing the data.

Let’s now try to understand the mathematics for computing the principal components. As I have already explained principal components are calculated as a linear combination of the predictor variables. Following is the formula for creating the first principal component, PC1

{ PC }1\quad =\quad { \phi }_{ 11 }{ X }_{ 1 }\quad +\quad { \phi }_{ 21 }{ X }_{ 2 }\quad +\quad ...\quad +\quad { \phi }_{ n1 }{ X }_{ n }

 

And we normalize the above equation by implementing the following constraint, otherwise, this could result in a very large variance

\sum _{ i=1 }^{ n }{ { \phi }_{ i1 }^{ 2 } } =\quad 1

 

Here the weights { \phi }_{ 11 }, { \phi }_{ 21 }{ \phi }_{ n1 }  are known as loadings of the first principal component or in vector form is called the loading vector of first principal component. The loadings { \phi }_{ 11 }, { \phi }_{ 21 }{ \phi }_{ n1 }  gives the direction in feature space along which data varies the most (like cosθ and sinθ in the above rotation of axes problem). Whereas projections of actual data points onto this direction are known as principal component scores. Let’s understand this by taking a dataset now. Consider an m x n dataset (i.e. m number of observation and n number of variables). The equation of principal component scores in more generalized form can be written as

{ PC }_{ i1 }\quad =\quad { \phi }_{ 11 }{ X }_{ i1 }\quad +\quad { \phi }_{ 21 }{ X }_{ i2 }\quad +\quad ...\quad +\quad { \phi }_{ n1 }{ X }_{ in }

 

As we are only interested in maximizing the variance, it is assumed that the means of each variable is zero just to make the computation simpler. So, the variance of the principal components can be written as the

\frac { 1 }{ m } \sum _{ j=1 }^{ m }{ { PC }_{ j1 }^{ 2 } }

 

Now our job is to simply maximize the above variance subject to the loadings constraint. This is a standard optimization problem

Maximize:

\frac { 1 }{ m } \sum _{ j=1 }^{ m }{ { PC }_{ j1 }^{ 2 } }

\frac { 1 }{ m } \sum _{ j=1 }^{ m }{ \left( \sum _{ i=1 }^{ n }{ { \phi }_{ i1 }{ X }_{ j1 } } \right) }

 

Subject to:

\sum _{ i=1 }^{ n }{ { \phi }_{ i1 }^{ 2 } } =\quad 1

 

This problem can be solved by Singular Value Decomposition (SVD) of the matrix of variables X. SVD is a very common matrix factorization technique in linear algebra. I will not explain how SVD works in this article but we will surely come up with this technique in another article.  Now you have understood how PCA increases the variances in the data. But how PCA can remove multicollinearity? To answer this question lets now move to create the second principal component. The second principal component is also created the same way, i.e. as a linear combination of the of the variables { X }_{ 1 }, { X }_{ 2}{ X }_{ n }  but this time we take those linear combinations that are uncorrelated with first principal components and have maximum variance in the data. By imposing this new constraint, we create new loading vector { \phi }_{ 2 } for second principal component which is orthogonal direction to first loading vector { \phi }_{ 1 }. And we know if any two variables are spread in an orthogonal direction, they are not correlated to each other.

I hope this article has made you understand the mathematics of principal component analysis. Let me know your thoughts in the comment section below. Stay tuned for the next part of the article where I will explain the how to implement PCA using Python. Happy learning.

Tagged , , ,