### Introduction to Softmax Regression

We have commonly used many classification algorithms for binary classification. Now we will see a classification technique which is used to classify k classes. This technique is called softmax regression.

Softmax regression is also called as multinomial logistic regression and it is a generalization of logistic regression. Softmax regression is used to model categorical dependent variables and the categories must not have any order (or rank). For dependent variables which have order, we need to use ordinal logistic regression which we will discuss some other time. For example, rather than classifying emails as spam or ham we may want to classify as spam, work related email and personal email. Softmax function falls under the family of exponential family distribution.

#### Deep dive

The exponential family of distributions over x, given parameters η, is defined to be the set of distributions of the form

*p*(X|η) =

*h*(X)

*g*(η) exp {η

^{T}u(X)}

where x may be scalar or vector and may be discrete or continuous. Here η are called the natural parameters of the distribution, and u(x) is some function of x

The function g(η) can be interpreted as the coefficient that ensures that the distribution is normalized.

Multinomial distribution for a single observation x takes the form

where x = (x_{1},…,x_{N})T and μ = Class labels

Standard representation,

*p*(X|η) = exp (η

^{T}u(X))

Performing few mathematical computations, we reach to the final equation: –

This is called the Softmax function or the normalized exponential. Multinomial distribution, therefore, takes the form:-

where M is the number of parameters.

This model, when applied to classification problems where there are multiple class labels, is called **Softmax regression**.

#### Implementation in R:

###### Datafile:

The data file used in the post for the explanation can be downloaded using this link.

###### R Code and interpretation:

#Data Structure

str(read_data)

read_data$NSPF <- as.factor(read_data$NSP)

#1-Normal Medical Condition #2 Suspect #3 Pathalogic

Converting the class label into factors.

read_data$out <- relevel(read_data$NSPF,ref = "1")

As we have 3 levels for our class label NSP/NSPF so we have to identify a reference level. So, we are going to use 1 as the reference level(i.e. Normal patient)

#Multinomial Logistic regression

library(nnet)

read_data$out<-relevel(read_data$NSPF, ref="1")

model<-multinom(out~LB+AC+FM+UC+DL+DS+DP+ASTV+MSTV+ALTV+MLTV, data=read_data)

We can see that the model starts with a very high error of 2335.64 then after 10 iterations it converges to 1123.04 and finally to an error of 608.05.

summary(model)

We can see the coefficients for levels 2 and 3 as the reference level is chosen to be 1. Residual deviance talks about the error existing in the model. This is similar to the sum of squares error in linear regression. The minimum the value of the better the model it is. We also have a standard for both the levels 2 and 3.

predict(model,read_data)

The model predicts 1st patient to be Pathologic, 2nd patient to be normal and so on.But comparing the actual output we see that 1st patient is predicted as Suspect but our model says the patient to be pathologic. So there is a certain misclassification.

predict(model,read_data,type="prob")

So what we see is for every patient there exists 3 probabilities. For example, for the 1st patient, we see that level 3 has maximum probability so model predicts the 1st patient to be pathologic.

Let’s say we are interested to find out the patient 456,567,667.

predict(model,read_data[c(456,567,667),],type="prob")

misclassification=table(predict(model),read_data$NSPF)

Now, we have the confusion matrix and we can calculate accuracy by adding the diagonal elements and dividing it by the total number of observations. So, the model accuracy is 88.24%.

#2-tailed z test

z <- summary(model)$coefficients/summary(model)$standard.errors

p <- (1 - pnorm(abs(z), 0, 1)) * 2

This gives us the p-value for all the attributes used in the model. Lower the p-value, higher is the confidence interval, which signifies that the attribute plays a significant role in classifying. The attribute MSTV of level 2 has a very high p-value of 0.93 indicating very less confidence interval so it implies that MSTV does not have significant contribution in the model when level 1 is a reference and we are looking at level 2 for the response. The same idea can be extrapolated to all the other attributes as well.

Removing insignificant variable MSTV and running the model increases the model accuracy to 88.47%.