Paul Penman

Introduction to Multiple Linear Regression – Python

Multiple Linear Regression (MLR) is the backbone of predictive modelling and machine learning and an in-depth knowledge of MLR is critical in the predictive modeling world.

Previously we discussed implementing multiple linear regression in R, now we’ll look at implementing multiple linear regression using Python.

You can download the data files for this tutorial here.

In this tutorial the focus is on estimating model parameters to fit a model in Python and then interpreting the results. We will use the same case study that we used in the R tutorial earlier to explain the Python code. As statistical concepts were discussed in detail earlier and we will summarize the key points here.

Python
Multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more.
Again, the price of a house in US dollars can be the dependent variable and the size of the house, its location, the air quality index in the area, distance from airport and so on, can be independent variables.
The price of a house is our target variable, which we call the DEPENDENT VARIABLE.

Statistical Model for MLR
Our statistical model has two parts –
The left hand side has the dependent variable denoted as Y, and the right hand side has independent variables denoted as X1, X2…up to Xp.
Each independent variable has a specific WEIGHTAGE called a REGRESSION PARAMETER.
The parameter b0 is the intercept in the model.
The parameters of the model are estimated using the LEAST SQUARE METHOD.

Multiple linear regression in Python

Multiple Linear Regression Case Study – Modeling Job Performance Index
Let’s illustrate all of these concepts using a case study. The objective is to model the Job Performance Index based on the various TEST scores of newly recruited employees. The dependent variable is Job Performance Index and the independent variables are aptitude, test of language, technical knowledge and general information.

Multiple linear regression in Python case study

Multiple Linear Regression Dataset Snapshot
Here is a snapshot of the data with our dependent and independent variables.
All variables are numeric in nature. Employee ID is obviously not used as a variable in the model

Multiple Linear Regression Dataset Snapshot


Graphical Representation of Data
It is always advisable to have a graphical representation of the data through scatter plots as these will give insights into bivariate relationships between variables. We import the example data with the help of the read _csv function available in the pandas library. To present our data graphically, we use the seaborn library and the ‘pairplot’ function in seaborn.
#Importing the Data
 import pandas as pd
 perindex = pd.read_csv("Performance Index.csv")
#Graphical Representation of the Data
import seaborn as sns
 sns.pairplot(perindex) 
Scatter Plot Matrix
The pairplot function in the seaborn library helps  visualize  the bivariate relationships between variables. It also shows the distribution of each variable using a histogram. We can observe that the job proficiency index has a high correlation with technical knowledge and general information score.

Scatter Plot Matrix

Model for the Case Study
This is our MLR model for the example, where the left-hand side is the dependent variable, which in our case is job performance index, and the right-hand side is the set of independent variables. ‘B Zero’ is the intercept or constant of the model whereas b1 to b 4 are our parameter estimates for the respective independent variables. E is the error term in the model.

 MLR model

Parameter Estimation using Least Square Method
The parameters are estimated using the least square method. Here we have 5 parameter estimates: One for each independent variable and a constant term ‘B0’. We now have the model equation completely defined in terms of variables and estimated parameters. Let us see how to get these values in Python.
Parameter Estimation using Least Square Method

Parameter Estimation Using ols() function in 
Python
We import the statsmodels library and use it with the alias smf. The function to fit the regression model is ols, which stands for Ordinary Least Square. The ols function requires a dependent variable and independent variables. The independent variables are separated using a plus sign.
The data argument specifies our case study dataset and the fit function estimates all our regression parameters. The results are stored in the jpimodel object. The params function used with the jpimodel object shows parameter estimates.
The sign of each parameter represents its relationship with the dependent variable.
#Model Fit
 import statsmodels.formula.api as smf
 jpimodel=smf.ols('jpi ~ tol + aptitude + technical +general', data=perindex).fit()
 jpimodel.params 
ols() fits a linear regression.
~ separates dependent and independent variables
Left hand side of tilde(~) represents the dependent variable and right-hand side shows independent variables
+ separates multiple independent variables.
#Output


Interpretation :
jpimodel.params gives the model parameters.
Signs of each parameter represent their relationship with the dependent variable.

Interpretation of Partial Regression Coefficients
Let’s learn how to interpret these partial regression coefficients. In general, we say that for every unit increase in an independent variable (X), the expected value of the dependent variable will change by the corresponding parameter estimate (b), keeping all other variables constant. For example, the parameter estimate for aptitude test is observed to be 0.32. therefore, we infer that for one unit increase in aptitude score, the expected value of the job performance index will increase by 0.32 units.

Interpretation of Partial Regression Coefficients

Quick Recap
To recap what we learned in this tutorial, we visualized bivariate relationships using a scatter plot matrix and discussed how to fit a MLR model in Python and interpret the coefficients of a model.

Multiple linear regression in Python tutorial recap