# Using K-modes for clustering categorical data

#### Introduction to K-Modes Algorithm

Clustering or (dividing into similar subgroups) forms a crucial part of data analysis.Dividing the entire data set into various similar subgroups helps us to gain a lot of insight from the data. All those who are in the field of analytics or trying to get into it must have heard about “K-means Algorithm”.It is one of the most well known and widely used  “unsupervised-learning Algorithms”  till date.In one of my article, I have used K-means to find the optimal nodal centers for Radial Basis function.In this article, we will look into a similar type of algorithm called “K-modes Algorithm” which uses modes instead of means to form clusters of categorical data.Let’s dive into the steps of the algorithm.
K-modes Algorithm
Step 1 –  Randomly assign “K” number of modes.(select initial “k” number of random data points as modes)
Step 2 –  Calculate the dissimilarity score between each of the remaining data points from the “K” number of chosen modes.
Step 3 –  Associate the data points to the mode whose score is minimum.(you will have K number of clusters)
Step 4 –  Use ‘Moving mode frequency based method’ to update the modes (for each of the k clusters we need to update the modes).
Step 5 –  Repeat from step 2 until there is no reassignment of clusters or when cost function is minimized.
There are certain terms like dissimilarity score, Moving mode frequency based method and cost function which need to be explained in a bit detail before we jump to the coding part.Mode, as you must know, means having highest frequency, so basically, the entire algorithm for K-modes is built upon using the highest frequency to form the clusters.

PYTHON CODE

In [2]:

```import pandas  as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
%matplotlib inline
```
In [3]:
```dat = pd.read_csv("car.csv")#Using the car evaluation dataset
```

### Its a labeled data set but we will use it for clustering by removing the class label

In [4]:
```dat.head()#the first five observations
```
Out[4]:
0vhighvhigh22smalllowunacc
1vhighvhigh22smallmedunacc
2vhighvhigh22smallhighunacc
3vhighvhigh22medlowunacc
4vhighvhigh22medmedunacc
In [30]:
```np.unique(dat['class'])#The data has 4 labels
```
Out[30]:
`array(['acc', 'good', 'unacc', 'vgood'], dtype=object)`
In [5]:
```dat.info()
```
```<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
maint       1728 non-null object
doors       1728 non-null object
persons     1728 non-null object
lug_boot    1728 non-null object
safety      1728 non-null object
class       1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB
```
In [6]:
```dat.describe()
```
Out[6]:
count1728172817281728172817281728
unique4443334
topmedmed3moremedmedunacc
freq4324324325765765761210
In [7]:
```#function to calculate the similarity score between two categorical objects
def similarity_score(x,y):
count = 0
for i in range(len(x)):
if x[i]==y[i]:
count +=1
return count
```

### Writing a function called K-modes it will return a dictionary containing the clustered dataset along with modes,individual cost function and total cost function

In [8]:
```def kmodes(data,clusters,iterations):
index=np.random.permutation(data.index)[:clusters]#randomly selecting index values for initial k modes
modes=[]
labels=[]
dis=[]
J=[]
d=dict()
#calculating the dissimilarity score between each categorical object from the modes and assigning label to minimum score
for i in index:
for i in range(len(data)):
dis=[]
for j in range(len(modes)):
dis.append(similarity_score(np.array(data.ix[i]),modes[j]))
labels.append(np.argmin(dis))
data['label']=np.array(labels)
#using frequency based method to update the weights
for i in range(iterations):
modes=[]
labels=[]
for i in np.unique(data['label']):
modes.append(np.array(data.ix[data['label']==i,:-1].describe().ix[2,:]))
#calculating the diss score between each categorical object and updated modes and assigning new label to minimum score
for i in range(len(data)):
dis=[]
for j in range(len(modes)):
dis.append(similarity_score(np.array(data.ix[i,:-1]),modes[j]))
labels.append(np.argmin(dis))
data['label']=np.array(labels)
#calculating the cost function of individual cluster
for i in range(len(modes)):
dis=[]
for j in range(len(data.ix[data['label']==i,:-1])):
dis.append(similarity_score(np.array(data.ix[data['label']==i,:-1].iloc[j]),modes[i]))
J.append(sum(dis))
d['modes']=modes
d['costfunction']=J
d['totalcost']=sum(J)
d['data']=data
return d
```
In [9]:
```dic = kmodes(dat.ix[:,:-1],4,2)#calling kmodes on the cars dataset with k equal to 4 as the original had 4 labels.
```
In [10]:
```dic['costfunction']#individual cost function of 4 clusters
```
Out[10]:
`[612, 232, 212, 160]`
In [11]:
```dic['totalcost']#total cost function
```
Out[11]:
`1216`
In [14]:
```cldata = dic['data']#lets see how the clusterd data looks like
```

### The first five observations of each class label

In [26]:
```cldata.ix[cldata['label']==0,:].head()
```
Out[26]:
1vhighvhigh22smallmed0
2vhighvhigh22smallhigh0
4vhighvhigh22medmed0
5vhighvhigh22medhigh0
7vhighvhigh22bigmed0
In [27]:
```cldata.ix[cldata['label']==1,:].head()
```
Out[27]:
0vhighvhigh22smalllow1
3vhighvhigh22medlow1
6vhighvhigh22biglow1
9vhighvhigh24smalllow1
12vhighvhigh24medlow1
In [28]:
```cldata.ix[cldata['label']==2,:].head()
```
Out[28]:
30vhighvhigh32medlow2
32vhighvhigh32medhigh2
33vhighvhigh32biglow2
35vhighvhigh32bighigh2
39vhighvhigh34medlow2
In [29]:
```cldata.ix[cldata['label']==3,:].head()
```
Out[29]: