You can download the data files for this tutorial here.

We’ll begin with an overview of various classification methods and then introduce the Naïve Bayes Classifier. The method is based on’ Theorem, which requires an understanding of the concept of conditional probability. We’ll discuss how to form the classification rule using Naïve Bayes method and then implement the method in R. We’ll finally discuss both the advantages and limitations of the method.

__Classification Methods__Apart from Naïve Bayes, there are several other machine Learning algorithms used for classification problems. These include Support Vector Machine, K Nearest Neighbour (KNN), Decision Tree, Random Forest and Neural Networks. For this tutorial we’ll firstly focus on the Naïve Bayes method, and look at the other methods in subsequent tutorials.

About the Naive Bayes ClassifierAbout the Naive Bayes Classifier

The Naïve Bayes classifier is a simple probabilistic classifier based on Bayes’ Theorem. It can be used as an alternative method to binary logistic regression or multinomial logistic regression. It’s important to note that the Naïve Bayes classifier assumes strong conditional independence among predictors, and is particularly suitable when the dimensionality of inputs is high. Despite its simplicity, Naive Bayes may in some situations outperform more sophisticated classification methods, but this is not always the case.

__Conditional Probability__Before we look at Bayes theorem, it’s important to understand the concept of conditional probability. Let’s consider a simple example that is typically discussed in academics, the “tossing of an unbiased die (which has numbers 1,2,3,4,5,6)”. So, the sample space has 6 points.

We can define an event A such that we get a number greater than 1 on the uppermost face of the die. We can define another event B such that we get an even number (i.e. either 2 or 4 or 6) on the uppermost face of the die.

By definition, the probability of an event is obtained as the ratio of the number of favourable outcomes to the total number of outcomes. Therefore, the probability of an occurrence of an event A is 5/6. Similarly, we get the probability of the occurrence of event B as 3/6.

We are now interested in knowing “what is the probability of the occurrence of event B given the condition that A has already occurred?” Since we already know that A has occurred, it implies that a number greater than 1 must have appeared on the uppermost face of the die. Hence, to get conditional probability of B given that A has occurred,

**the sample space has only 5 points.**(i.e. the uppermost face must have been either 2,3,4,5 or 6). The favourable number of cases for the occurrence of B are still 3 (that is 2,4,6). Hence by the definition of probability, we shall get probability of B given that A has occurred is 3/5.

Bayes TheoremBayes Theorem

**This is the statement of Bayes Theorem.**The principle behind Naive Bayes is the Bayes theorem, also known as the Bayes Rule. Bayes theorem is used to calculate conditional probability, which is simply but the probability of an event occurring based on information about events in the past. Mathematically, the Bayes theorem is represented as shown in this equation:

Naive Bayes FrameworkNaive Bayes Framework

To get a better understanding of how Naive Bayes works in classification problems, let’s look at the following situation:

Here Y is the target variable, which must be categorical. It is important to note that this method is not applicable if Y is a continuous variable. However, X variables or predictors can be either categorical or continuous variables. The objective is to estimate the probability of Y taking a specific value given the values of X variables. Since these are conditional probabilities, Bayes theorem will be used to estimate them.

Naive Bayes Framework – ExampleNaive Bayes Framework – Example

Now let’s look at a hypothetical example in order to further understand the framework of the Naïve Bayes method. Here we have Y as a binary variable that takes a value of 1, in case if the person is a “potential buyer” (of a certain product) and Y takes a value of 0, if the person is not a “potential buyer”. We consider two other variables X1 as “Age” and X2 as “Gender”, both coded as binary variables.

Classification RuleClassification Rule

Here we find conditional probabilities, probability of Y is equal to zero, given the values of X1 and X2 and also the probability of Y is equal to One, given the values of X1 and X2. Based on these two probabilities, we can classify Y to be either zero or one. Here Y can have more than two categories and, in that case, you may classify Y to that category for which the conditional probability is maximum.

Expected OutputExpected Output

This the expected output when we apply Naïve Bayes method using any software. We get the estimated probabilities for “Y is equal to 1” as well as Y” is equal to zero”. In general, if we have K categories, we shall get K estimated probabilities. Based on these predicted probabilities we can classify Y to a specific category.

Advantages of the Naive Bayes MethodAdvantages of the Naive Bayes Method

Let us look at the advantages of Naïve Bayes method. Firstly, the classification rule is simple to understand. Secondly, the method requires a small amount of training data to estimate the parameters necessary for classification.Thirdly, the evaluation of the classifier is quick and easy and finally the method can be a good alternative to logistic regression.

Limitations of Naive Bayes MethodLimitations of Naive Bayes Method

There are also few limitations to the Naïve Bayes Method.

The assumption of conditional independence of the independent variables is highly impractical. In the case of continuous independent variables, the density function must be known or assumed to be normal. In the case of categorical independent variables, the probabilities cannot be calculated if the count in any conditional category is zero. For instance, if there are no respondents in the age group 25-30 yrs. then P(X1=0 | Y=1) = 0.

However a remedy does exist for this limitation. If a category has zero entries, we replace 0 by 0.5/n (n = sample size) so that the probability expression does not reduce to zero.

However a remedy does exist for this limitation. If a category has zero entries, we replace 0 by 0.5/n (n = sample size) so that the probability expression does not reduce to zero.

Case Study – Modeling Loan DefaultsCase Study – Modeling Loan Defaults

Let’s now implement the Naïve Bayes Method with bank loan data. Here we are going to apply Naïve Bayes method to the Bank Loan Data and then compare its performance with Binary Logistic Regression. For this data, we have assumed that a bank possesses demographic and transactional data of its loan customers. If the bank has a model to predict defaulters it can help in loan disbursal decision making. The objective here is “to predict whether the customer applying for the loan will be a defaulter or not”. The sample size for this data is 700. We have, Age group, Years at current address, Years at current employer, Debt to Income Ratio, Credit Card Debts, Other Debts as Independent

**Variables**and Defaulter (=1 if defaulter ,0 otherwise) is the**Dependent Variable**. The information on predictors was collected at the time of the loan application process. The status is observed after a loan is disbursed.

Bank Loan DataBank Loan Data

This is the snapshot of the data. In this data, Age is the categorical variable, with three categories, although it is coded as integers. The other independent variables are continuous whereas the Dependent variable “ Defaulter” , is a binary variable.

Logistic Regression in RLogistic Regression in R

Before we implement the Naïve Bayes method, we first of all apply Binary Logistic Regression (BLR) to the “Bank Loan Data” to understand the performance of BLR. We import the data file using familiar

“family=binomial”, indicates that it is Binary Logistic Regression. The analysis output is stored in the object “riskmodel”.

**read.csv function**and look at the structure of the data. We notice that “ Age” is an integer variable, which needs to be converted to a factor so that we can treat the variable appropriately. We use the**glm function**, in R, which stands for General Linear Model. We specify six independent variables and the dependent variable to be “Defaulter”.“family=binomial”, indicates that it is Binary Logistic Regression. The analysis output is stored in the object “riskmodel”.

# Importing data and checking data structure

bankloan<-read.csv("BANK LOAN.csv",header=T)

bankloan<-read.csv("BANK LOAN.csv",header=T)

str(bankloan)