Paul Penman

Binary Logistic Regression – An introduction

In this tutorial we’ll learn about binary logistic regression and its application to real life data. Without any doubt, Binary Logistic Regression remains the most widely used predictive modelling method.

You can download the data files for this tutorial here.

Firstly, let's discuss what binary logistic regression is and its applications in different sectors. We’ll then focus on statistical models and hypothesis testing and finally we will use a case study from the banking domain to strengthen our understanding.

Binary logistic regression models the relationship between a set of independent variables and a binary dependent variable. It is useful when the dependent variable is dichotomous in nature, like death or survival, absence or presence and so on. Independent variables can be categorical or continuous, for example, gender, age, income or geographical region. Binary logistic regression models a dependent variable as a logit of p, where p is the probability that the dependent variables take a value of 1.



Binary logistic regression models are used across many domains and sectors. For example, it can be used in marketing analytics to identify potential buyers of a product, in human resources management to identify employees who are likely to leave a company, in risk management to predict loan defaulters, or in insurance, where the objective is to predict policy lapse. All of these objectives are based on information such as age, gender, occupation, premium amount, purchase frequency, and so on. In all these objectives, the dependent variable is binary, whereas independent variables are categorical or continuous.



Why can’t we use linear regression for binary dependent variables? One reason is that the distribution of Y is random and not normal, as in the case of linear regression. Also, the left-hand side and right hand of the model will not be comparable if we use linear regression for a binary dependent variable.



Linear regression is suitable for predicting a continuous value such as the price of property based on area in square feet. In such a case the regression line is a straight line. Logistic regression on the other hand is used for classification problems, which predict a probability that a dependent variable Y takes a value of 1, given the values of predictors. In binary logistic regression, the regression curve is a sigmoid curve.



Statistical Model for Binary Logistic Regression

The equation below is a statistical model for binary logistic regression with a single predictor. The small p is the probability that the dependent variable ‘Y’ will take the value 1, given the value of ‘X’, where X is the independent variable. The natural log of “p divided by one minus p” is called the logit or link function. The right-hand side is a linear function of X, very much similar to a linear regression model.


The general logistic regression model for a single predictor can be extended to a model with k predictors and is represented as given here. In this equation, p is the probability that Y equals one given X, where Y is the dependent variable and X’s are independent variables. B0 to bk are the parameters of the model, they are estimated using the maximum likelihood method, which we’ll discuss shortly. The left-hand side of the equation ranges between minus infinity to plus infinity.

Note that LHS of the model can lie between – ∞ to
Binary Logistic Regression Case Study
Let’s now look at the concept of binary logistic regression using a banking case study. We have a bank that has the demographic and transactional data of its loan customers. The bank wants to develop a model that predicts defaulters and can help the bank in loan disbursal decision making. The objective here is to predict whether customers applying for a loan will default or not. We use a sample of size 700 to develop the model. The independent variables are age group, years at current address, years at current employer, debt to income ratio, credit card debts and other debts. All of these variables are collected at the time of the loan application process and will be used as independent variables. The dependent variable is the status observed after the loan is disbursed, which will be 1 if it is a defaulter, and 0 if not.


Here’s a snapshot of our data. Our dependent variable is binary, whereas the independent variables are either categorical or continuous in nature.



Before moving to modelling, it’s always useful to perform exploratory analysis. Let’s do some bivariate analysis. Table 1 presents the relationship between defaulter status and different age groups. Table 2 shows the transactional behaviour for defaulters and non-defaulters. The average credit card liability of defaulters is 2.42 vs. 1.25 for non-defaulters. Also, average debt to income for defaulters is very high as compared to non-defaulters.



This is how the binary logistic regression model will look for our case study. Here, p is the probability that the customer will be a defaulter. The parameter b is the intercept and b1 b2 etc are coefficients of other independent variables. For the purposes of understanding, we have included all independent variables in the model. However, the final model will be presented using only significant independent variables.



As mentioned before, the parameters of the logistic regression model are estimated using maximum likelihood estimation. The likelihood function is a joint probability of Y1, Y2….up to Yn. The parameters are estimated by maximising the likelihood function, L. Two commonly used iterative algorithms are the Fisher scoring method and the Newton-Raphson method. Both of these algorithms give the same parameter estimates with a slight difference in the estimated covariance matrix. From an application point of view, we don’t need to worry about complex mathematics. The algorithm is available as a built-in function in R and Python.



The table here gives all parameter estimates that can be used to write the model equation. The model equation can be used to estimate probability of default by substituting values of specific customer characteristics. Note that AGE is a categorical variable with 3 categories and hence two coefficient values are shown for two dummy variables.

  



The logit link or logit function is also known as the log of odds, where odds is the probability of success divided by the probability of failure. The estimated parameter gives the change in log of odds given one unit change in the independent variable. For example, the estimated coefficient of employ (that is number of years customer is working at current employer) is -0.26172. This means that one unit change in employ will result in a change of -0.26712 in log of odds. The negative sign implies that a customer with a steady job is less likely to be a loan defaulter. We usually interpret the odds ratio rather than regression coefficients for a logistic regression model. The odds ratio gives a measure of association between the dependent and independent variable.


Let us look at what the odds ratio is. The odds ratio is a measure of association between the independent variable and an outcome. The odds ratio is the ratio of odds of the event occurring given X=1 and X= 0. It represents the factor by which the odds change for a one unit change in the independent variable. Taking the ant-log of the regression coefficient we can get the odds ratio.



The table shows the odds ratio for each independent variable. An odds ratio greater than one indicates a positive association between the dependent and independent variables, whereas an odds ratio less than one indicates a negative relationship between the dependent and independent variables. An odds ratio equal to one indicates no association between the variables. For example, the odds ratio of the employ independent variable is 0.77 indicates that for one unit change in employ, the odds of being a defaulter will change by 0.77 fold or decrease by 23%.


To identify which independent variables are statistically significant and are to be included in the final model, we use the Wald’s test. This test is used for assessing the significance of each independent variable separately. The null hypothesis for Wald’s test is “Specific Parameter is Zero”. It also means that there is no effect of that predictor on the dependent variable. The test statistic given by Z is the ratio of the estimate of the independent variable to the standard error of the estimate of the independent variable. Under the null hypothesis, the test statistic is assumed to follow standard normal distribution. We reject the null hypothesis if the p value is less than 0.05.




Let us do quick recap. In this tutorial, we learned about the binary logistic regression model and its application. We also looked how to estimate parameters of binary logistic regression using the maximum likelihood method and how to interpret the odds ratio.