Basics, Unsupervised Learning

Implementing Principal Component Analysis using Python

This article is in continuation of my previous article on Mathematics of Principal Component Analysis (PCA). It is advised to go through that article before moving into this article. In this post, I will explain how to implement PCA using Python. I have taken the wholesale customer distribution dataset from UCI Machine Learning repository. This dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) in diverse product categories. We will try to implement PCA on this dataset to find out which product categories are related to each other based on the spending habits of their clients. Let’s load the required libraries and the dataset

# loading libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# loading the data

data = pd.read_csv("customer.csv")
data = data.drop(["Channel", "Region"], axis=1)

data summary

Now we will do the EDA on the above dataset

# producing a scatter matrix for each pair of features in the data

pd.scatter_matrix(data, alpha = 0.3, figsize = (14,10), diagonal='kde');

correlation matrix

From the above scatter-plots, it seems there is a linear relationship between the spending habits of milk, grocery and detergents_paper items. Also, there might be a linear relationship between spending habits on fresh and frozen products. Let’s now try to analyze the dataset by creating six principal components

# scaling the data before PCA

from sklearn.preprocessing import scale
data = pd.DataFrame(scale(data), columns=['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen'])
# implementing PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=6).fit(data)
pca_samples = pca.transform(data)
def pca_results(data, pca):
    # Dimension indexing
    dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]
    # PCA components
    components = pd.DataFrame(np.round(pca.components_, 4), columns = data.keys()) 
    components.index = dimensions

    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1) 
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance']) 
    variance_ratios.index = dimensions

    # Create a bar plot visualization
    fig, ax = plt.subplots(figsize = (14,8))

    # Plot the feature weights as a function of the components
    components.plot(ax = ax, kind = 'bar')
    ax.set_ylabel("Feature Weights") 
    ax.set_xticklabels(dimensions, rotation=0)

    # Display the explained variance ratios# 
    for i, ev in enumerate(pca.explained_variance_ratio_): 
        ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n %.4f"%(ev))

    # Return a concatenated DataFrame
    return pd.concat([variance_ratios, components], axis = 1)

pca_results = pca_results(data, pca)

bar plot

Below table shows the cumulative variance explained by the principal components in the above dataset


cumulative sum

From the above table, we can see that first four components together explains 94.19% variance in the data. But getting to know how many principal components should we use is a very important step in an analysis. To help us select a number of principal components we use what is known as scree plot. This plot shows cumulative explained variance on y-axis and number of principal components on the x-axis. As a rule of thumb, our focus is to get an elbow in the plot. The number of principal components where we get the elbow is the optimal number of components that we should select for our analysis. Let’s now try to plot the scree plot in our case.

From the below plot we can observe that we got that elbow in the plot corresponding to the 2nd number of principal components. Hence we should use only two number of principal components in our analysis.

#Explained variance
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

scree plot

Lets now redo our analysis by taking only two principal components and plot the biplot for the above dataset.

# creating a biplot

pca = PCA(n_components=2).fit(data)
reduced_data = pca.transform(data)
pca_samples = pca.transform(data)
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
def biplot(data, reduced_data, pca):
    fig, ax = plt.subplots(figsize = (14,8))
    # scatterplot of the reduced data 
    ax.scatter(x=reduced_data.loc[:, 'Dimension 1'], y=reduced_data.loc[:, 'Dimension 2'], facecolors='b', edgecolors='b', s=70, alpha=0.5)
    feature_vectors = pca.components_.T

    # using scaling factors to make the arrows
    arrow_size, text_pos = 7.0, 8.0,

    # projections of the original features
    for i, v in enumerate(feature_vectors):
        ax.arrow(0, 0, arrow_size*v[0], arrow_size*v[1], head_width=0.2, head_length=0.2, linewidth=2, color='red')
        ax.text(v[0]*text_pos, v[1]*text_pos, good_data.columns[i], color='black', ha='center', va='center', fontsize=18)

    ax.set_xlabel("Dimension 1", fontsize=14)
    ax.set_ylabel("Dimension 2", fontsize=14)
    ax.set_title("PC plane with original feature projections.", fontsize=16);
    return ax

biplot(data, reduced_data, pca)

Implementing Principal Component Analysis

The biplot above shows that the products milk, grocery, and detergents_paper are aligned towards the principal component 1 or dimension 1. Whereas the fresh and frozen products are aligned towards the principal component 2 or dimension 2. These seem intuitive as we have already seen their relationship in the scatter plot above where there seems to be a linear relationship between the group of products milk, grocery and detergents_paper and fresh and frozen products. Hence principal component analysis reduced the overall dimension of the dataset from six variables to two variables by also removed multicollinearity in the data by aligning the related variables into their respective principal components or dimensions