Data Science Institute Tutorial Team

Multiple Linear Regression in R – a tutorial

Multiple Linear Regression (MLR) is the backbone of both predictive modelling and machine learning and an in-depth knowledge of MLR is critical to understanding these key areas of data science. This tutorial provides an initial introduction to MLR using R.

You can download the data files for this tutorial here.

When is multiple linear regression used?
Multiple linear regression explains the relationship between one continuous dependent variable and two or more independent variables. The following example will illustrate this:

The price of a house in USD can be a dependent variable. The area of the house, its location, the air quality index in the area, distance from the airport, for example can be independent variables. Independent variables can be continuous, such as the air quality index, or categorical, such as the location of the house. The price of the house is our target variable, which we call the dependent variable. To sum up, we have one dependent variable and a set of independent variables.

Statistical model for multiple linear regression
The statistical model for multiple linear regression has two parts – the left hand side has the dependent variable denoted as Y, and the right hand side has independent variables denoted as X1 , X2…up to Xp.
This means that there are, in general, p independent variables, with each independent variable having a specific weightage, which we call a regression parameter.

The parameter b0 is termed the regression intercept in the model.

So how do we get the values of these unknown parameters using known values of Y and X variables?
To do this, we use the least square method. This method minimises the error sum of squares in the data to fit the optimum model. Software gives least square estimates as the main output of the regression model.

Multiple linear regression statistical model

Multiple Linear Regression Case Study
Let’s illustrate these concepts using a case study. The objective is to model a job performance Index based on the various test scores of newly recruited employees. Our dependent variable is Job Performance Index, and our independent variables are aptitude, test of language, technical knowledge, and general information.

Multiple Linear Regression Case Study

Here’s a snapshot of the data with our dependent and independent variables. All variables are numeric in nature and obviously the employee ID is not used as a model variable.

It’s always advisable to have a graphical representation of the data, such as scatter plots, which give us insights into the variables’ bivariate relationships.

We import our example data using the read.csv function in R and use the GGally library and the ggpairs function to present our data graphically, specifically to create scatterplots for our variables of interest.

Importing the Data
perindex<-read.csv("Performance Index.csv",header=TRUE)  
Graphical Representation of the Data
  title="Scatter Plot Matrix",
The ggpairs function in the GGally library helps to visualise the bivariate relationships between two variables, as well as quantify them in the form of correlation coefficients, while also giving the distribution for each variable. We observe that the job proficiency index has a high correlation with technical knowledge and general information scores.

Scatterplot with Bivariate relationships- ggplot, GGally

Usually, multiple linear regression is more robust than simple linear regression. A single predictor provides inadequate information about the response variable. In contrast, a simultaneous study of multiple variables is essential as the response is always influenced by more than one variable, as seen in the example just explained.

Multiple linear regression can answer many questions such as:

  • Do tests conducted at recruitment time determine a candidate’s performance in the initial six months of the job?
  • Which of the four test scores is more significant in determining job performance?
  • Can any test be discontinued?
  • Can the performance of newly recruited candidates be estimated based on test scores at the time of recruitment?

This is our MLR model for our case study, where the left-hand side is the dependent variable, and which in our case is the job performance index and the right hand side is the set of independent variables. B0 is the intercept or constant of the model, whereas b1 to b4 are our parameter estimates for the respective independent variables. Finally, e is the error term in the model.

MLR Model Job performance index

Parameters are estimated using the least square method as discussed previously and here are five parameter estimates – one for each independent variable and a constant term B0. We now have a model equation wholly defined in terms of variables and estimated parameters.

Parameter estimates

Now, let us fit the model using the lm function in R. lm stands for linear model, and we define an object, jpimodel, to show its coefficient estimates. The lm function requires a dependent variable, and independent variables are separated using a plus sign.
Model Fit
jpimodel<-lm(jpi~aptitude+tol+technical+general, data=perindex)
lm() fits a linear regression.
~ separates dependent and independent variables
Left hand side of tilde(~) represents the dependent variable and right-hand side shows independent variables
+ separates multiple independent variables.

Model Output
The table shows the output of the MLR model as displayed in R. Coefficients are the model parameter estimates, and the sign of each parameter represents its relationship with the dependent variable.

MLR Model Output

Let’s see how to interpret these partial regression coefficients. In general, we say that for every unit increase in the independent variable (X), the expected value of the dependent variable will change by the corresponding parameter estimate (b), keeping all other variables constant. For example, the parameter estimate for aptitude test is observed to be 0.32. Therefore, we infer that for one unit increase in aptitude score, the expected value of the job performance index will increase by 0.32 units.

Partial regression coefficients

Here’s a recap of the main concepts covered in this tutorial. First, we learned how to understand our data and ensure consistency in the dataset. We then covered how to represent our data graphically by using the ggpairs function. Lastly, we learned how to fit a multiple linear regression model in R and interpret its coefficients.

Introduction to Multiple linear regression in R