You can download the data files for this tutorial here.

We’ll first recap a few aspects of binary logistic regression and then focus on statistical modeling, hypothesis testing and classification tables using Python. We’ll use a case study in the banking domain to demonstrate the method.

__Binary Logistic Regression in Python__Binary logistic regression models the relationship between a set of independent variables and a binary dependent variable. It is useful when the dependent variable is dichotomous in nature, such as death or survival, absence or presence, pass or fail, for example. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P (Y=1) as a function of X. Independent variables can be categorical or continuous, for example, gender, age, income, geographical region and so on. Binary logistic regression models a dependent variable as a logit of p, where p is the probability that dependent variables take a value of ‘one'.

Statistical Model – For k PredictorsStatistical Model – For k Predictors

So what does the statistical model in binary logistic regression look like? In this equation, p is the probability that Y equals one given X, where Y is the dependent variable and X’s are independent variables. B 0 to B K are the parameters of the model. These parameters of the model are estimated using the maximum likelihood method. The left-hand side of the equation ranges between minus infinity to plus infinity.

where,

where,

p : Probability that Y=1 given X

Y : Dependent Variable

X1, X2 ,…, Xk : Independent Variables

b0, b1 ,…, bk : Parameters of Model

Case Study – Modeling Loan DefaultsCase Study – Modeling Loan Defaults

Let’s explain the concept of binary logistic regression using a case study from the banking sector. Our bank has the demographic and transactional data of its loan customers. It wants to develop a model that predicts defaulters and help the bank in its loan disbursal decision making. The objective here is to predict whether customers applying for a loan will be defaulters or not. We will use a sample of size 700 to develop the model. The independent variables are age group, years at current address, years at current employer, debt to income ratio, credit card debt and other debt. All of these variables are collected at the time of the loan application process and will be used as independent variables. The dependent variable is the status observed after the loan is disbursed, which will be one if it is a defaulter and zero if not.

BLR Data SnapshotBLR Data Snapshot

Here’s a snapshot of the data. Our dependent variable is binary, whereas the independent variables are either categorical or continuous in nature.

Binary Logistic Regression in PythonBinary Logistic Regression in Python

Let’s import our data and check the data structure in Python. As usual, we import the data using

**read_csv**function in the pandas library, and use the**info**function to check the data structure. We can see here that the Age variable is an integer type.# Import data and check data structure before running model