Multiple Linear Regression (MLR) is the backbone of predictive modelling and machine learning and an in-depth knowledge of MLR is critical in the predictive modeling world.
Previously we discussed implementing multiple linear regression in R, now we’ll look at implementing multiple linear regression using Python.
Multiple Linear Regression Case Study – Modeling Job Performance Index
Multiple Linear Regression Dataset Snapshot
Graphical Representation of Data
Previously we discussed implementing multiple linear regression in R, now we’ll look at implementing multiple linear regression using Python.
You can download the data files for this tutorial here.
In this tutorial the focus is on estimating model parameters to fit a model in Python and then interpreting the results. We will use the same case study that we used in the R tutorial earlier to explain the Python code. As statistical concepts were discussed in detail earlier and we will summarize the key points here.
Python
Python
Multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more.
Again, the price of a house in US dollars can be the dependent variable and the size of the house, its location, the air quality index in the area, distance from airport and so on, can be independent variables.
The price of a house is our target variable, which we call the DEPENDENT VARIABLE.
Statistical Model for MLR
Our statistical model has two parts –
The left hand side has the dependent variable denoted as Y, and the right hand side has independent variables denoted as X1, X2…up to Xp.
The left hand side has the dependent variable denoted as Y, and the right hand side has independent variables denoted as X1, X2…up to Xp.
Each independent variable has a specific WEIGHTAGE called a REGRESSION PARAMETER.
The parameter b0 is the intercept in the model.
The parameters of the model are estimated using the LEAST SQUARE METHOD.
Multiple Linear Regression Case Study – Modeling Job Performance Index
Let’s illustrate all of these concepts using a case study. The objective is to model the Job Performance Index based on the various TEST scores of newly recruited employees. The dependent variable is Job Performance Index and the independent variables are aptitude, test of language, technical knowledge and general information.
Multiple Linear Regression Dataset Snapshot
Here is a snapshot of the data with our dependent and independent variables.
All variables are numeric in nature. Employee ID is obviously not used as a variable in the model
Graphical Representation of Data
It is always advisable to have a graphical representation of the data through scatter plots as these will give insights into bivariate relationships between variables. We import the example data with the help of the read _csv function available in the pandas library. To present our data graphically, we use the seaborn library and the ‘pairplot’ function in seaborn.
#Importing the Data