Basics, Python

Using K-modes for clustering categorical data

Introduction to K-Modes Algorithm

Clustering or (dividing into similar subgroups) forms a crucial part of data analysis.Dividing the entire data set into various similar subgroups helps us to gain a lot of insight from the data. All those who are in the field of analytics or trying to get into it must have heard about “K-means Algorithm”.It is one of the most well known and widely used  “unsupervised-learning Algorithms”  till date.In one of my article, I have used K-means to find the optimal nodal centers for Radial Basis function.In this article, we will look into a similar type of algorithm called “K-modes Algorithm” which uses modes instead of means to form clusters of categorical data.Let’s dive into the steps of the algorithm.
K-modes Algorithm
                     Step 1 –  Randomly assign “K” number of modes.(select initial “k” number of random data points as modes)
                     Step 2 –  Calculate the dissimilarity score between each of the remaining data points from the “K” number of chosen modes.
                     Step 3 –  Associate the data points to the mode whose score is minimum.(you will have K number of clusters)
                     Step 4 –  Use ‘Moving mode frequency based method’ to update the modes (for each of the k clusters we need to update the modes).
                     Step 5 –  Repeat from step 2 until there is no reassignment of clusters or when cost function is minimized.
There are certain terms like dissimilarity score, Moving mode frequency based method and cost function which need to be explained in a bit detail before we jump to the coding part.Mode, as you must know, means having highest frequency, so basically, the entire algorithm for K-modes is built upon using the highest frequency to form the clusters.
      formula
formula
 PYTHON CODE

In [2]:

import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
%matplotlib inline
In [3]:
dat = pd.read_csv("car.csv")#Using the car evaluation dataset

Its a labeled data set but we will use it for clustering by removing the class label

In [4]:
dat.head()#the first five observations
Out[4]:
buyingmaintdoorspersonslug_bootsafetyclass
0vhighvhigh22smalllowunacc
1vhighvhigh22smallmedunacc
2vhighvhigh22smallhighunacc
3vhighvhigh22medlowunacc
4vhighvhigh22medmedunacc
In [30]:
np.unique(dat['class'])#The data has 4 labels
Out[30]:
array(['acc', 'good', 'unacc', 'vgood'], dtype=object)
In [5]:
dat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying      1728 non-null object
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB
In [6]:
dat.describe()
Out[6]:
buyingmaintdoorspersonslug_bootsafetyclass
count1728172817281728172817281728
unique4443334
topmedmed3moremedmedunacc
freq4324324325765765761210
In [7]:
#function to calculate the similarity score between two categorical objects
def similarity_score(x,y):
    count = 0
    for i in range(len(x)):
        if x[i]==y[i]:
            count +=1
    return count

Writing a function called K-modes it will return a dictionary containing the clustered dataset along with modes,individual cost function and total cost function

In [8]:
def kmodes(data,clusters,iterations):
    index=np.random.permutation(data.index)[:clusters]#randomly selecting index values for initial k modes
    modes=[]
    labels=[]
    dis=[]
    J=[]
    d=dict()
#calculating the dissimilarity score between each categorical object from the modes and assigning label to minimum score
    for i in index:
        modes.append(np.array(data.ix[i]))#inital k modes added 
    for i in range(len(data)):
        dis=[]
        for j in range(len(modes)):
            dis.append(similarity_score(np.array(data.ix[i]),modes[j]))
        labels.append(np.argmin(dis))
    data['label']=np.array(labels)
#using frequency based method to update the weights
    for i in range(iterations):
        modes=[]
        labels=[]
        for i in np.unique(data['label']):
            modes.append(np.array(data.ix[data['label']==i,:-1].describe().ix[2,:]))
#calculating the diss score between each categorical object and updated modes and assigning new label to minimum score
        for i in range(len(data)):
            dis=[]
            for j in range(len(modes)):
                dis.append(similarity_score(np.array(data.ix[i,:-1]),modes[j]))
            labels.append(np.argmin(dis))
        data['label']=np.array(labels)
#calculating the cost function of individual cluster
    for i in range(len(modes)):
        dis=[]
        for j in range(len(data.ix[data['label']==i,:-1])):
            dis.append(similarity_score(np.array(data.ix[data['label']==i,:-1].iloc[j]),modes[i]))
        J.append(sum(dis))
    d['modes']=modes
    d['costfunction']=J
    d['totalcost']=sum(J)
    d['data']=data
    return d
In [9]:
dic = kmodes(dat.ix[:,:-1],4,2)#calling kmodes on the cars dataset with k equal to 4 as the original had 4 labels. 
In [10]:
dic['costfunction']#individual cost function of 4 clusters
Out[10]:
[612, 232, 212, 160]
In [11]:
dic['totalcost']#total cost function
Out[11]:
1216
In [14]:
cldata = dic['data']#lets see how the clusterd data looks like

The first five observations of each class label

In [26]:
cldata.ix[cldata['label']==0,:].head()
Out[26]:
buyingmaintdoorspersonslug_bootsafetylabel
1vhighvhigh22smallmed0
2vhighvhigh22smallhigh0
4vhighvhigh22medmed0
5vhighvhigh22medhigh0
7vhighvhigh22bigmed0
In [27]:
cldata.ix[cldata['label']==1,:].head()
Out[27]:
buyingmaintdoorspersonslug_bootsafetylabel
0vhighvhigh22smalllow1
3vhighvhigh22medlow1
6vhighvhigh22biglow1
9vhighvhigh24smalllow1
12vhighvhigh24medlow1
In [28]:
cldata.ix[cldata['label']==2,:].head()
Out[28]:
buyingmaintdoorspersonslug_bootsafetylabel
30vhighvhigh32medlow2
32vhighvhigh32medhigh2
33vhighvhigh32biglow2
35vhighvhigh32bighigh2
39vhighvhigh34medlow2
In [29]:
cldata.ix[cldata['label']==3,:].head()
Out[29]:
buyingmaintdoorspersonslug_bootsafetylabel
21vhighvhigh2moremedlow3
22vhighvhigh2moremedmed3
31vhighvhigh32medmed3
34vhighvhigh32bigmed3
45vhighvhigh3moresmalllow3

 

Tagged , ,