Research question and the data science problem

  • The research question is anchored on the main objective of this project which is create a machine learning model that can predict which individuals are most likely to have or use a bank account.
  • The data science problem will be a classification problem.
  • My personal objective on this project is how to handle an imbalanced dataset in a classification problem

Data Health and Preparation

Training data contains 23,524 observations and 13 variables. From the total variables, 13 are nominal (including the response column) and 3 numeric. The independent variable is bank_account which has observations as yes or no. Part of data wrangling will entail encoding observations of the response variable to match 1 and 0 for yes and no respectively. A quick glimpse on the response column observations indicate a high imbalance with 86 percents of all persons not having a bank account.

No missing values are present in the dataset.

Half of the all respondents are of 35 years of age whereas 75% of all respondents are 49 years and below. Householdwise, the largest household consisted of 21 members given the average number being 4. On both variables, measures of spread indicated presence of outliers. Using Tukey’s rule of outlier detection, 2.77% of all observations had outliers. Given their marginal proportion, the outliers were dropped.

Data Analysis

Descriptive

On both the variables, the distribution is light tailed (platykurtic). Outliers, which were marginal in relative to the total sample were removed during the data wrangling process. Age has a high standard deviation indicating that values are spread out around the mean. Standard deviation (σ) as a measure of dispersion, its an indicator of how accurately the mean(μ) represents the sample data.

variable mean sd SEM IQR skewness kurtosis p25 p50 p75 p100
Age 38.21 15.63 0.10 22 0.72 -0.28 26 35 48 81
Household Size 3.69 2.03 0.01 3 0.59 -0.45 2 3 5 9
1 table 1 : Descritpive statistitcs of age and the household size

Standard Error of the mean (SEM) on the other hand, measures how far the mean of the sample is likely to be from the true population mean. SEM is be inversely proportional to the sample size. As the sample size increases the mean value becomes more representative of the population, hence SEM reduces towards zero.

From table one, we can see that the SEM is more close to zero indicating that the mean of the sample close to the mean of the total sample. In deducing any of the coefficients mentioned above, primary assumption is that all observations from the sample are statistically independent.

Normal Q-Q Plot for Age

Normal Q-Q Plot for Household Size

Exploratory


  • Fig I: Mean age of all individuals who both have and don’t have a bank account is 38 and 39 respectively.

  • Fig II: Almost 90% of all females do not have a bank account


  • 72% of all male headed household possess a bank account
  • 9 in 10 females own a bank account where a female is a spouse.

  • 57 % of male headed households do not have a bank account as compared to female counterparts (43%)

  • 56% of females who own a bank account have no formal education as compared to their male counterparts (43.8%).

  • 71% of of female respondents who don’t own a bank account, don’t have a formal education. This is in contrast with 29% of male respondents
  • Finally, regardless of presence or absence of a bank account female respondents account the highest demographic of not undergone formal education.

Machine Learning

Random Forest Classifier

# Split the data into training and validation sets
set.seed(3456)
splitting.index <- createDataPartition(training$BankAccount, p=.68,
                                    list = F, 
                                    times = 1)

# training and validation set
training.data <- training[splitting.index,]
validation.data <- training[-splitting.index,]


# The classifier
model <- randomForest(formula = BankAccount ~ ., data = training.data)
model
## 
## Call:
##  randomForest(formula = BankAccount ~ ., data = training.data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 11.67%
## Confusion matrix:
##        No Yes class.error
## No  12964 354  0.02658057
## Yes  1458 748  0.66092475
# Print the model output
print(model)
## 
## Call:
##  randomForest(formula = BankAccount ~ ., data = training.data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 11.67%
## Confusion matrix:
##        No Yes class.error
## No  12964 354  0.02658057
## Yes  1458 748  0.66092475

From the model;

  • Variables randomly selected at each split are 3.
  • Out of the Bag error - a validation technique used in this model is 11%, hence the model has accuracy of around 89%.
  • Finally, below is a plot showing variable importance. Job type has the most predictive power. It’s closely followed by education level of a person and age.

Partial Dependence (DP) Plots

Partial Dependence plots (PDP) illustrate the relationship between an input variable and the response variable. They show how the predictions partially depend on values of the input. Useful to infer relationships between input variables and the predictor variables in complex non-parametric models.

The relationship can either be causal, (non)linear, monotonic, curvilinear, step function and so on. In our model we’ll illustrate PDP using the input that has the most predictive power (job type) and the response variable (presence of a bank account). Below are the PD plots for the 3 important variables against the predictor variable

PD plot of job title

PD plot of education level

PD plot of age

On the nominal observations of input variables under job title(No formal employment) and education level(primary education) have the highest effect on the response variable. Age on the other hand is inversely proportional. As the age decreases so does the ability of one to open a bank account.

Tuning the classifier

Hyperparameter tuning is usually treated as an optimization problem. Even though random forest is an out of the box algorithm, there are three important parameters that affect the accuracy of the model. These are ntrees, mtry and sampsize. ntrees is the number of trees in the forest. By default, its usually 500, mtry denotes the number of random predictor variables selected on each split and sampsize defaults to 63.2% of the training examples. Reason for the explicit proportion is because 63.2% is the expected number of unique observations in a bootstrap sample. Other hyper parameters that affect the performance of a model include:

  • nodesize - When the nodesize is small, it allows deeper more complex trees to be grown.
  • maxnodes - Together with ntrees, it limits the tree growth to avoid overfitting.
# Tuning mtry using out of bag (OOB) error
tune_mtry <- tuneRF(training.data[,1:11], 
                    training.data$BankAccount, 
                    ntreeTry = 500, 
                    stepFactor = 2, 
                    plot = F, 
                    trace = T, 
                    doBest = T,
                    improve = 0.01)
## mtry = 3  OOB error = 11.64% 
## Searching left ...
## mtry = 2     OOB error = 11.5% 
## 0.01162147 0.01 
## mtry = 1     OOB error = 12.76% 
## -0.1091825 0.01 
## Searching right ...
## mtry = 6     OOB error = 12.5% 
## -0.08678611 0.01
tune_mtry
## 
## Call:
##  randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 11.51%
## Confusion matrix:
##        No Yes class.error
## No  13114 204  0.01531762
## Yes  1583 623  0.71758840
  • mtry is 2 because it has the least OOB error. Very marginal with three.

  • Pruning the trees in a random forest model is taken care of. The model builds a large collection of de-correlated trees. This removes correlation between the trees

After hyperparamenter tuning, both the model with 3 and 2 mtry give an out of the bag error of 11%.

Evaluating the Random Forest Classifier

Confusion Matrix

# Predict using validation set
predict_BA <- predict(object = model, 
                      newdata = validation.data, 
                      type = "class")

# Create a confusion matrix 
cM <- confusionMatrix(data = predict_BA,
                      reference = validation.data$BankAccount)
cM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  6122  677
##        Yes  144  361
##                                           
##                Accuracy : 0.8876          
##                  95% CI : (0.8801, 0.8948)
##     No Information Rate : 0.8579          
##     P-Value [Acc > NIR] : 3.617e-14       
##                                           
##                   Kappa : 0.4133          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9770          
##             Specificity : 0.3478          
##          Pos Pred Value : 0.9004          
##          Neg Pred Value : 0.7149          
##              Prevalence : 0.8579          
##          Detection Rate : 0.8382          
##    Detection Prevalence : 0.9309          
##       Balanced Accuracy : 0.6624          
##                                           
##        'Positive' Class : No              
## 

From the matrix, the model was able to predict 6120 as true positives and 361 as true negatives.

  • Type one error where the the model predicted that individuals have a bank account but they actually dont was 677.
  • Type two error where the model predicted that individuals don’t have a bank account but they actually have was 361.
 

Project done by Molo Muli

Contact me