Boosting in R

Are you a big fan of ensemble models?. Well, here is boosting in R – Yet another ensemble-based method. In this article, we will explore how boosting works in R and how we can make our model predictions better. Let’s roll!!!.

Table of Contents

Boosting in R is an ensemble-based method used to boost the performance of weak learners. Similar to bagging, the boosting algorithms will make us ensembles of models which are trained on resamples of data. The voting is made to figure the final prediction.

Before moving forward, you should understand two distincts of boosting in R.

The resampled data in the boosting are constructed in a way to generate complementary learners.
Boosting won’t support equal distribution of votes like bagging. Boosting will provide votes based on individual performances. So, the better the model performs, the greater the influence on the final prediction.

Adaboost – Adaptive Boosting in R

The concept of AdaBoost was first proposed by Freund and Schapire back in the year 1997. The AdaBoost or the adaptive boosting will generate the weak learners and train them with much complex or difficult to classify examples or data points.

Adabag package:

You have to use the adabag package to implement the AdaBoost.M1 classifier. Once this classifier is trained, you can use this for predictions over unseen data. You can measure the error rate using a separate dataset or you can use cross-validation as well.

Well, I hope you got a good understanding of boosting, adaboosting and adabag. Now, let’s see all of them in action.

Credit dataset

We are using credit data for this purpose. Let’s explore the data using functions such as str() and summary() to get some insights into the data.



#Read the dataset

df <- read.csv('credit.csv')

#Explore the datatypes

str(df)



'data.frame':	1000 obs. of 20 variables: $ months_loan_duration: chr "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ... $ credit_history : int 6 48 12 42 24 36 24 36 12 30 ... $ purpose : chr "critical" "repaid" "critical" "repaid" ... $ amount : chr "radio/tv" "radio/tv" "education" "furniture" ... $ savings_balance : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ... $ employment_length : chr "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ... $ installment_rate : chr "> 7 yrs" "1 - 4 yrs" "4 - 7 yrs" "4 - 7 yrs" ... $ personal_status : int 4 2 2 2 3 2 3 2 2 4 ... $ other_debtors : chr "single male" "female" "single male" "single male" ... $ residence_history : chr "none" "none" "none" "guarantor" ... $ property : int 4 2 3 4 4 4 4 2 4 2 ... $ age : chr "real estate" "real estate" "real estate" "building society savings" ... $ installment_plan : int 67 22 49 45 53 35 53 35 61 28 ... $ housing : chr "none" "none" "none" "none" ... $ existing_credits : chr "own" "own" "own" "for free" ... $ default : int 2 1 1 1 2 1 1 1 1 2 ... $ dependents : int 1 2 1 1 2 1 1 1 1 2 ... $ telephone : int 1 1 2 2 2 2 1 1 1 1 ... $ foreign_worker : chr "yes" "none" "none" "none" ... $ job : chr "yes" "yes" "yes" "yes" ...

Take some time to analyze and understand what these numbers are telling to us. Describing data is the key aspect of any analysis work and you should spend some time here.

Adaboost classifier using boosting in R

So, we have the data. I hope you spent some time understanding what it is about. Now, we can move forward and create the train and test data to perform boosting.

You can start by installing required libraries.



#Load required libraries

library(caret)

library(adabag)



#Creates the train and test split [90:10]

credit_data <- createDataPartition(df$default, p=0.90, list = F)

train_data <- df[credit_data, ]

test_data <- df[-credit_data, ]

We have created the train and test data with a 90:10 ratio. 90% train data and 10% test data. You can see the glimpse of train and test data below.



#Explore train data

str(train_data)



'data.frame':	901 obs. of 20 variables: $ months_loan_duration: chr "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ... $ credit_history : int 6 48 12 42 24 36 24 36 12 30 ... $ purpose : chr "critical" "repaid" "critical" "repaid" ... $ amount : chr "radio/tv" "radio/tv" "education" "furniture" ... $ savings_balance : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ... $ employment_length : chr "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ... $ installment_rate : chr "> 7 yrs" "1 - 4 yrs" "4 - 7 yrs" "4 - 7 yrs" ... $ personal_status : int 4 2 2 2 3 2 3 2 2 4 ... $ other_debtors : chr "single male" "female" "single male" "single male" ... $ residence_history : chr "none" "none" "none" "guarantor" ... $ property : int 4 2 3 4 4 4 4 2 4 2 ... $ age : chr "real estate" "real estate" "real estate" "building society savings" ... $ installment_plan : int 67 22 49 45 53 35 53 35 61 28 ... $ housing : chr "none" "none" "none" "none" ... $ existing_credits : chr "own" "own" "own" "for free" ... $ default : int 2 1 1 1 2 1 1 1 1 2 ... $ dependents : int 1 2 1 1 2 1 1 1 1 2 ... $ telephone : int 1 1 2 2 2 2 1 1 1 1 ... $ foreign_worker : chr "yes" "none" "none" "none" ... $ job : chr "yes" "yes" "yes" "yes" ...



#Explore test data

str(test_data)



'data.frame':	99 obs. of 20 variables: $ months_loan_duration: chr "unknown" "unknown" "unknown" "1 - 200 DM" ... $ credit_history : int 9 10 6 24 27 12 36 12 18 24 ... $ purpose : chr "critical" "critical" "fully repaid" "delayed" ... $ amount : chr "car (new)" "furniture" "radio/tv" "furniture" ... $ savings_balance : int 2134 2069 426 2333 5965 6468 1953 1007 1568 3617 ... $ employment_length : chr "< 100 DM" "unknown" "< 100 DM" "unknown" ... $ installment_rate : chr "1 - 4 yrs" "1 - 4 yrs" "> 7 yrs" "0 - 1 yrs" ... $ personal_status : int 4 2 4 4 1 2 4 4 3 4 ... $ other_debtors : chr "single male" "married male" "married male" "single male" ... $ residence_history : chr "none" "none" "none" "none" ... $ property : int 4 1 4 2 2 1 4 1 4 4 ... $ age : chr "other" "other" "other" "building society savings" ... $ installment_plan : int 48 26 39 29 30 52 61 22 24 20 ... $ housing : chr "none" "none" "none" "bank" ... $ existing_credits : chr "own" "own" "own" "own" ... $ default : int 3 2 1 1 2 1 1 1 1 2 ... $ dependents : int 1 1 1 1 1 2 2 1 1 1 ... $ telephone : int 1 1 1 1 1 1 1 1 1 1 ... $ foreign_worker : chr "yes" "none" "none" "none" ... $ job : chr "yes" "no" "yes" "yes" ...

You have to convert the target variable (default) to factors to avoid the error.



#Convert the target variable as factors train_data$default <- as.factor(train_data$default)

test_data$default <- as.factor(test_data$default)



#Trains the model

my_model <- boosting(default~., data = train_data, boos = T, mfinal = 10) #Model in action

predict_model <- predict(my_model, test_data) #Confusion matrix of the predictions predict_model$confusion #Computes error

predict_model$error



Observed Class

Predicted Class 1 2 3 1 54 7 1 2 10 25 2 Error - 0.2020202

Fantastic. You have built an AdaBoost classifier to predict the loan defaulters in the input dataset. That’s how boosting works in R. Feel free to explore more parameters of the predict_model.

Adaboost classifier with boosting.cv in R

The boosting.cv is another method where you train the model on train data with many subsets. Let’s see how it works in R.



#Convert target variable as factors

df$default <- as.factor(df$default) #Create boosting.cv classifier model_cv <- boosting.cv(default~., data = df, boos = T, mfinal = 10, v=5) #Measuer the predictions model_cv$confusion #Measure the error

model_cv$error

 Observed Class

Predicted Class 1 2 3 4 1 543 102 9 3 2 88 229 18 3 3 2 2 1 0 Error - 0.227

That’s it. You have built 2 classifier models using boosting and boosting.cv methods. The AdaBoost classifier is performing quite well. This is because of boosting techniques. You can try these methods using other datasets as well.

Ending note

Boosting in R is the ensemble-based method, which will boost the performance of the models. I hope after reading this, you can use boosting methods to improve the model performance. That’s all for now. Happy R!!!

More read: R documentation