**Introduction**

In my last post, “Understanding mathematics behind Logistic Regression“, I explained the basic maths behind logistic regression. In this post, I intend to implement logistic regression model in R using Titanic dataset. I have used Titanic dataset for explaining logistic regression where the target variable is ‘Survived’ which has two values 0 and 1.

**Data Dictionary**

Variable | Definition | Key |
---|---|---|

Survived | Survival | 0 = No, 1 = Yes |

Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |

Sex | Sex | |

Age | Age in years | |

SibSp | No. of siblings / spouses aboard the Titanic | |

Parch | No. of parents / children aboard the Titanic | |

Ticket | Ticket number | |

Fare | Passenger fare | |

Cabin | Cabin number | |

Embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

I will also explain the interpretation of the output from the R code. So let’s start.

**R Code**

**Loading the dataset**

Titanic package in R has both training and test datasets. So we will use this package to directly load the dataset.

## Load Titanic library to get the dataset library(titanic) ## Load the datasets data("titanic_train") data("titanic_test")

**Data details and missing values imputation**

## Setting Survived column for test data to NA titanic_test$Survived <- NA ## Combining Training and Testing dataset complete_data <- rbind(titanic_train, titanic_test) ## Check data structure str(complete_data) ## 'data.frame': 1309 obs. of 12 variables: ## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... ## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ... ## $ Sex : chr "male" "female" "female" "female" ... ## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... ## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... ## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... ## $ Cabin : chr "" "C85" "" "C123" ... ## $ Embarked : chr "S" "C" "S" "S" ... ## Let's check for any missing values in the data colSums(is.na(complete_data)) ## PassengerId Survived Pclass Name Sex Age ## 0 418 0 0 0 263 ## SibSp Parch Ticket Fare Cabin Embarked ## 0 0 0 1 0 0 ## Checking for empty values colSums(complete_data=='') ## PassengerId Survived Pclass Name Sex Age ## 0 NA 0 0 0 NA ## SibSp Parch Ticket Fare Cabin Embarked ## 0 0 0 NA 1014 2 ## Check number of uniques values for each of the column to find out columns which we can convert to factors sapply(complete_data, function(x) length(unique(x))) ## PassengerId Survived Pclass Name Sex Age ## 1309 3 3 1307 2 99 ## SibSp Parch Ticket Fare Cabin Embarked ## 7 8 929 282 187 4 ## Missing values imputation complete_data$Embarked[complete_data$Embarked==""] <- "S" complete_data$Age[is.na(complete_data$Age)] <- median(complete_data$Age,na.rm=T) ## Removing Cabin as it has very high missing values, passengerId, Ticket and Name are not required library(dplyr) titanic_data <- complete_data %>% select(-c(Cabin, PassengerId, Ticket, Name))

**Dummy variable creation**

For logistic regression, it is important to create dummy variables for all the categorical variables so that all the factors are taken into account while creating the model. Dummy variables are simply a variable for each category of a categorical variable. For example, in our dataset we have a variable ‘Pclass’, so creating a dummy variable for this will create three different variables in the dataset for each of the class (Pclass_1, Pclass_2, PClass_3). I’ll first convert all the categorical variables to factors and then I’ll use ‘Dummies’ package in R to create dummy variables for each of these variables.

## Converting "Survived","Pclass","Sex","Embarked" to factors for (i in c("Survived","Pclass","Sex","Embarked")){ titanic_data[,i]=as.factor(titanic_data[,i]) } ## Create dummy variables for categorical variables library(dummies) titanic_data <- dummy.data.frame(titanic_data, names=c("Pclass","Sex","Embarked"), sep="_")

**Data Splitting**

Next, we split the dataset into training and testing sets. I’ll split the train_data from the overall data as it has the value of target variable. There are 891 observations in the training dataset and I’ll split that in 75:25 ratio. This will help us calculate the model accuracy. We can then again use ‘predict()’ to predict the target variable of the test dataset.

## Splitting training and test data train <- titanic_data[1:667,] test <- titanic_data[668:889,]

**Model Creation**

## Model Creation model <- glm(Survived ~.,family=binomial(link='logit'),data=train) ## Model Summary summary(model) ## ## Call: ## glm(formula = Survived ~ ., family = binomial(link = "logit"), data = train) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.3804 -0.6562 -0.4300 0.6392 2.3950 ## ## Coefficients: (3 not defined because of singularities) ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.373105 0.319779 -4.294 1.76e-05 *** ## Pclass_1 2.175104 0.359365 6.053 1.42e-09 *** ## Pclass_2 1.302268 0.271680 4.793 1.64e-06 *** ## Pclass_3 NA NA NA NA ## Sex_female 2.677814 0.226863 11.804 < 2e-16 *** ## Sex_male NA NA NA NA ## Age -0.031671 0.008945 -3.540 0.000399 *** ## SibSp -0.248975 0.123365 -2.018 0.043570 * ## Parch -0.091603 0.141950 -0.645 0.518718 ## Fare -0.001397 0.003179 -0.440 0.660254 ## Embarked_C 0.431447 0.271693 1.588 0.112288 ## Embarked_Q 0.533193 0.369337 1.444 0.148837 ## Embarked_S NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 891.99 on 666 degrees of freedom ## Residual deviance: 605.78 on 657 degrees of freedom ## AIC: 625.78 ## ## Number of Fisher Scoring iterations: 5

**Interpreting Model Results**

From the results we can see that ‘SibSp’, ‘Parch’, ‘Fare’ and ‘Embarked’ are not statistically significant. For important variables, we see that “Sex_Female” has lowest p-value which means that gender of the passenger has strong association with the target variable. The coefficient of the variable ‘Sex_Female’ is positive which tell us that females have high probability of surviving on the ship given that all other variables are constant. Being a female, increases the odds of survival by 2.68 while if we look for age coefficient, we can say that one unit increase in the age reduces the odds of survival by 0.032. For age, we say that it will reduce the odds as the coefficient for age is negative.

**Testing variable importance using Anova**

## Using anova() to analyze the table of devaiance anova(model, test="Chisq") ## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: Survived ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 666 891.99 ## Pclass_1 1 39.603 665 852.39 3.112e-10 *** ## Pclass_2 1 26.485 664 825.91 2.655e-07 *** ## Pclass_3 0 0.000 664 825.91 ## Sex_female 1 197.978 663 627.93 < 2.2e-16 *** ## Sex_male 0 0.000 663 627.93 ## Age 1 8.986 662 618.94 0.002721 ** ## SibSp 1 8.114 661 610.83 0.004393 ** ## Parch 1 0.998 660 609.83 0.317889 ## Fare 1 0.044 659 609.79 0.834588 ## Embarked_C 1 1.936 658 607.85 0.164139 ## Embarked_Q 1 2.067 657 605.78 0.150485 ## Embarked_S 0 0.000 657 605.78 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We use Anova() function for analyzing table of deviance. The anova() summary above tells us how adding each variable to the model has an effect on the model against null model. The difference between the null deviance and residual deviance is used for finding that out. We can see that adding ‘Pclass’, ‘Sex’ and ‘Age’ to the model reduces the residual deviance significatntly. However, other variables do not reduce the residual deviance significantly. Higher p-values for ‘SibSp’, ‘Fare’, ‘Embarked’, and ‘Parch’ also indicates that these variables are not statistically significant.

**Predicting target variable and confusion matrix statistics**

## Predicting Test Data result <- predict(model,newdata=test,type='response') result <- ifelse(result > 0.5,1,0) ## Confusion matrix and statistics library(caret) confusionMatrix(data=result, reference=test$Survived) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 128 25 ## 1 13 56 ## ## Accuracy : 0.8288 ## 95% CI : (0.7727, 0.8759) ## No Information Rate : 0.6351 ## P-Value [Acc > NIR] : 1.817e-10 ## ## Kappa : 0.6187 ## Mcnemar's Test P-Value : 0.07435 ## ## Sensitivity : 0.9078 ## Specificity : 0.6914 ## Pos Pred Value : 0.8366 ## Neg Pred Value : 0.8116 ## Prevalence : 0.6351 ## Detection Rate : 0.5766 ## Detection Prevalence : 0.6892 ## Balanced Accuracy : 0.7996 ## ## 'Positive' Class : 0 ##

From the confusion matrix we can see that for 0, the misclassification is 25, i.e. 25 variables were predicted 1 by the model whereas their value was 0. Similarly, for 1, misclassification is 13. The model accuracy is 82.88% which is fairly good.

**ROC(Reciver-Operator Charateristics) Curve**

## ROC Curve and calculating the area under the curve(AUC) library(ROCR) predictions <- predict(model, newdata=test, type="response") ROCRpred <- prediction(predictions, test$Survived) ROCRperf <- performance(ROCRpred, measure = "tpr", x.measure = "fpr") plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7), print.cutoffs.at = seq(0,1,0.1))

The ROC curve is used for calculating AUC(Area under the curve) which is used for measuring the performance of a binomial classifier. The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) then to 0.5.

auc <- performance(ROCRpred, measure = "auc") auc <- auc@y.values[[1]] auc ## [1] 0.8714211

Hope this post helps you in understanding the basics of logistic regression and also how we can implement logistic regression in R.