Paul Penman

Binary Logistic Regression with R – a tutorial

In a previous tutorial, we discussed the concept and application of binary logistic regression. We’ll now learn more about binary logistic regression model building and its assessment using R.

Firstly, we’ll recap our earlier case study and then develop a binary logistic regression model in R followed by an explanation of model sensitivity and specificity, and understand how to estimate these. You can download the data files for this tutorial here.

Binary Logistic Regression Data Snapshot
Let’s consider the example of a loan disbursement. Here’s a snapshot of the data. A bank wants to develop a model which predicts defaulters in order to help its loan disbursal decision making. The dependent variable is the status observed after the loan is disbursed, which will be 1 if a customer is a defaulter, and 0 otherwise. The variables – age group, years at current address, years at current employer, debt to income ratio, credit card debts and other debts are our independent variables.

Data Snapshot

Binary Logistic Regression in R
First we import our data and check our data structure in R. As usual, we use the read.csv function and use the str function to check the data structure. Age is a categorical variable and therefore needs to be converted into a factor variable. We use the ‘factor’ function to convert an integer variable to a factor.

Import the data and check the data structure before running the model
data<-read.csv("BANK LOAN.csv",header=TRUE)
str(data) 

Output

Data Frame Data Check

Logistic Regression in R
Logistic regression is a type of generalized linear regression and therefore the function name is glm. We use the argument family equals to binomial for specifying the regression model as binary logistic regression. As in the linear regression model, the dependent and independent variables are separated using the tilde sign and the independent variables are separated by the plus sign.

Using the glm function to develop a binary logistic regression model
 riskmodel<-glm(DEFAULTER~AGE+EMPLOY+ADDRESS+DEBTINC+CREDDEBT+OTHDEBT,
                family=binomial,data=data) 
glm is Generalized Linear Model. Logistic regression is a type of GLM.
LHS of ~ is the dependent variable and the independent variables on RHS are separated by ‘+’. 
riskmodel is the model object
By setting the family = binomial, glm() fits a logistic regression model_

Individual Hypothesis Testing in R
Which independent variables have an impact on the customer turning into a defaulter?
After fitting the logistic regression model, we can carry out individual hypothesis testing to identify significant variables. We simply use the summary function on the model object and then get detailed output. Variables whose P value is less than 0.05 are considered to be statistically significant. Since the p-value is < 0.05 for Employ, Address, Debtinc, and Creddebt, these independent variables are significant.

Individual Testing
summary(riskmodel) 
summary() function gives the output of glm

Output


Individual Testing in R
Once we obtain our coefficients, we check them for their signs based on business logic. If the coefficient sign does not match with the business logic, then that variable should be reconsidered for inclusion in the model.

Re-run Model in R
Next, we re-run the binary logistic regression model by including only significant variables. The output of the summary function provides revised estimates of the model parameters.
riskmodel<-glm(DEFAULTER~EMPLOY+ADDRESS+DEBTINC+CREDDEBT,
                family=binomial,data=data)
 summary(riskmodel) 
In this output, all independent variables are statistically significant and the signs are logical. This model can therefore be used for further diagnostics.

Output
Binary logistic regression model re-run

Final Model
This is the final model after substituting the values of parameter estimates. The probability of defaulting can be predicted when the values of the X variables are entered into this equation.

Final Binary logistic regression model

Odds Ratio in R
As discussed previously, we use the odds ratio to measure the association between independent variables and dependent variables. In R, we identify model coefficients using the coef function and estimate the odds ratio by taking the antilog. The conf-int argument inside the exponential function calculates the confidence interval for the odds ratio of the model. Having calculated these, we then combine these estimates with the model coefficients using the cbind function.
coef(riskmodel)
 exp(coef(riskmodel)) 
 exp(confint(riskmodel))
 cbind(coef(riskmodel),odds_ratio=exp(coef(riskmodel)),exp(confint(riskmodel))) 
coef(riskmodel): identify the model coefficients.

exp(coef(riskmodel)): find odds ratio.

exp(confint(riskmodel)): calculates confidence interval for odds ratio.

From the output, we see that none of the confidence intervals for the odds ratio includes one, which indicates that all the variables included in the model are significant. The odds ratio for CREDDEBT is approximately 1.77
For a one unit change in CREDDEBT, the odds of being a defaulter will change by 1.77 fold.

Outpu
t

Odds ratio in R

Predicting Probabilities in R
We predict the probability of the final model using the fitted function. The round function helps to round probabilities to two decimal places. Predicted probabilities are saved in the same dataset, ‘data’ in a new variable, ‘predprob’.

Predicting Probabilities
 data$predprob<-round(fitted(riskmodel),2)
 head(data,n=10)  

fitted function generates the predicted probabilities based on the final riskmodel.

the round function rounds the probabilities to 2 decimals.

data$predprob: Predicted probabilities are saved in the same dataset ‘data’ in a new variable ‘predprob’.
This is the data with the predicted probabilities. The last column in the data gives predicted probabilities using the final model.

Output
Probability prediction table

Classification Table
It is important to measure the goodness of fit of any fitted model. Based on some cut off value of probability, the dependent variable Y is estimated to be either one or zero. A cross tabulation of observed values of Y and the predicted values of Y is known as a classification table. Since this classification table varies with the cut off value, it is not considered to be a good measure of goodness of fit unless an optimum cut-off is obtained.   

The accuracy percentage measures how accurate a model is in predicting outcomes. In the table, the dependent variable equals zero was observed and predicted 479 times, whereas it was observed and predicted to be one 92 times. Therefore, the accuracy rate is calculated as 479 plus 92, divided by the total sample size 700. The accuracy is 81.57 %.

Observed and expected values

Misclassification
Next, we see what is meant by the misclassification rate. The misclassification rate is the percentage of wrongly predicted observations. In this example, the misclassification rate is obtained as 38 + 91 divided by 700 giving a misclassification rate of 18.43%.
Misclassification rate

Classification Table Terminology
Different terminologies are used for observations in the classification table. They are sensitivity, specificity, false positive rate and false negative rate. The sensitivity of a model is the percentage of correctly predicted occurrences or events. It is the probability that the predicted value of Y is one, given the observed value of Y being one.

On the contrary, specificity is the percentage of non-occurrences being correctly predicted: that is the probability that the predicted value of Y is zero, given that the observed value of Y is also zero. The false positive rate is the percentage of non-occurrences that are predicted wrongly as events. Similarly, the false negative rate is the percentage of occurrences which are predicted incorrectly.

Classification table
Sensitivity and Specificity calculations
This table represents the accuracy, sensitivity and specificity values for different cut off values. On the basis of our accuracy, sensitivity and specificity values, we can deduce that the cut off value of 0.3 is the best cut off value for the model.

Sensitivity and Specificity

Classification and Sensitivity and Specificity table in R
Let us obtain our classification table in R. We use the table function to create a cross table of the observed and predicted values of the dependent variable. Here, TRUE indicates predicted defaulters, whereas FALSE indicates predicted non-defaulters. There are 479 correctly predicted non-defaulters and 92 correctly predicted defaulters, whereas there are 38 wrongly predicted defaulters and 91 wrongly predicted non-defaulters.
# Predicting Probabilities
classificationtable<-table(data$DEFAULTER,data$predprob > 0.5)
 classificationtable 
table function will create a cross table of observed Y (defaulter) vs. predicted Y (predprob).

Output
Interpretation

Sensitivity and Specificity in R
Let's now calculate sensitivity and specificity values in R, using the formula discussed above. On calculation, the sensitivity of the model is 50.3%, whereas specificity is at 92.7%. The sensitivity value is definitely lower than the desired value.
# Sensitivity and Specificity
sensitivity<-(classificationtable[2,2]/(classificationtable[2,2]+classificationtable[2,1]))*100
 sensitivity
 specificity<-(classificationtable[1,1]/(classificationtable[1,1]+classificationtable[1,2]))*100
 specificity 

Output
sensitivity 
 [1] 50.27322
 specificity 
 [1] 92.6499 

Interpretation
The Sensitivity is at 50.3% and the Specificity is at 92.7% . This is when the cutoff was set at 0.5

Quick Recap
 In this tutorial, we explained how to perform binary logistic regression in R. Model performance is assessed using sensitivity and specificity values. Sensitivity is the percentage of events correctly predicted, whereas specificity is the percentage of non-events correctly predicted.