Multiple Linear Regression (MLR) is the backbone of both predictive modelling and machine learning and an in-depth knowledge of MLR is critical to understanding these key areas of data science. This tutorial provides an initial introduction to MLR using R.
When is multiple linear regression used?
Statistical model for multiple linear regression
Here’s a snapshot of the data with our dependent and independent variables. All variables are numeric in nature and obviously the employee ID is not used as a model variable.
It’s always advisable to have a graphical representation of the data, such as scatter plots, which give us insights into the variables’ bivariate relationships.
We import our example data using the read.csv function in R and use the GGally library and the ggpairs function to present our data graphically, specifically to create scatterplots for our variables of interest.
Importing the Data
You can download the data files for this tutorial here.
When is multiple linear regression used?
Multiple linear regression explains the relationship between one continuous dependent variable and two or more independent variables. The following example will illustrate this:
The price of a house in USD can be a dependent variable. The area of the house, its location, the air quality index in the area, distance from the airport, for example can be independent variables. Independent variables can be continuous, such as the air quality index, or categorical, such as the location of the house. The price of the house is our target variable, which we call the dependent variable. To sum up, we have one dependent variable and a set of independent variables.
Statistical model for multiple linear regression
The statistical model for multiple linear regression has two parts – the left hand side has the dependent variable denoted as Y, and the right hand side has independent variables denoted as X1 , X2…up to Xp.
This means that there are, in general, p independent variables, with each independent variable having a specific weightage, which we call a regression parameter.
The parameter b0 is termed the regression intercept in the model.
So how do we get the values of these unknown parameters using known values of Y and X variables?
To do this, we use the least square method. This method minimises the error sum of squares in the data to fit the optimum model. Software gives least square estimates as the main output of the regression model.
Multiple Linear Regression Case Study
Let’s illustrate these concepts using a case study. The objective is to model a job performance Index based on the various test scores of newly recruited employees. Our dependent variable is Job Performance Index, and our independent variables are aptitude, test of language, technical knowledge, and general information.
Here’s a snapshot of the data with our dependent and independent variables. All variables are numeric in nature and obviously the employee ID is not used as a model variable.
It’s always advisable to have a graphical representation of the data, such as scatter plots, which give us insights into the variables’ bivariate relationships.
We import our example data using the read.csv function in R and use the GGally library and the ggpairs function to present our data graphically, specifically to create scatterplots for our variables of interest.
Importing the Data