What are Imbalanced datasets?
Dataset that exhibits an unequal distribution between its classes is considered to be imbalanced dataset. These types of datasets are typically found in spaces like Predictive Maintenance Systems, Sales Propensity, Fraud Identification etc…
For example, in a predictive maintenance scenario, a data set with 20000 observations is classified by Failure or Non-Failure classes. After analysing the data, It was found that the only 1% of the dependent variable represents Failure class and remining 99% of observations are Non-Failure. That’s an imbalanced proportion of classes.
We have recently come across with an imbalanced data set with 74 features/variables in which few of the categorical variables have almost 150 levels and there are missing values present in these variables. So, this is a high dimensional imbalanced data set problem where most of the Machine Learning techniques will fail to predict the minority classes (over fit and Biased to majority classes).
Performance of general ML algorithms on Imbalanced datasets:
To understand the complexity of the problem, we have produced below the performance of one of the powerful ML algorithms – RandomForestClassifier which was trained on the data set in discussion. In this dataset, login is the minority class (99%) and no-login (1%) is the majority class. The total number of records are around 1.24 lakh
As we can clearly observe that the ability of the model to predict login (Minority class in our case) is zero (Recall=0.00) although the accuracy is around 99%.
This means that the algorithm is failing miserably to differentiate the two classes (majority & minority) in the data set because of the imbalance in the classes and high dimensionality in the data set that includes dense categorical variables.
The below are the ways through which we handled this problem and trained a model which predicted 90 % of the minority classes:
1. Reduce number of variables:
More number of variables lead to the over fitting of data, to overcome this drop irrelevant variables (subject matter expertise required) and also variables with missing values more than an optimum level (15% considered in our case). We have also calculated the importance scores of the variables and dropped variables that have least importance scores.
2. Dealing with categorical variables:
Most of the Python ML Algorithms are numerical friendly, categorical data (such as Gender or Region…) be encoded to numerical data type and fed to Algorithm. If we end up with large number of classes in categorical variables (ex: Cities in a country could be in the order of hundreds.) that will lead to inefficiency (overfitting) of Algorithm and consumption of more computational power. We solved this by selecting an algorithm that can handle categorical variables internally. We also ensured that the categorical variables that are considered are not correlated to each other.
3. Imputing Missing Values:
We imputed numerical missing values using MICE and categorical variables with the most frequent values. It’s not suggested imputing categorical variables using most frequent value, because that will impute the minor class missing values with the major class values. Imputing the categorical variables with the most frequent value for every particular class in a Label is a good approach.
4. Improving recall of minor class:
As we have seen initially, training general purpose algorithms on this dataset would over fit on the majority class as the dataset has line share of Minority class. So, as we discussed in the latter half of this article, we have selected a very strong machine learning algorithm called “catboost”, that is robust to Overfitting.
As we always favour boosting based machine learning algorithms in our data science problems, we have chosen Catboost algorithm as it also comes with handling categorical variables internally.
Catboost avoids overfitting of model with the help of overfitting detector which leads to more generalized models. It is based upon an exclusive algorithm for constructing models that differs from the standard gradient-boosting scheme.
Initial test results of the Catboost after applying on to the processes data set:
The initial results of Catboost Algorithm with the default hyper-parameters are quite convincing giving a recall 0.47 i.e; the accuracy of the model to predict logins/0s is 47 % which is 0% with the normal algorithms and by including all the variables.
The Final test results:
Further tuning the hyper-parameters of the “catboost” gave us the below results:
As it is evident we managed to boost the recall i.e; the accuracy of the model to predict logins/0s from 47 % to 89%.
Thanks to CatBoost , the data processing procedure and the systematic thought process for the data science solution which helped us to achieve the fantastic result.
Note: As the result is impressive, we have not applied resampling techniques (over sampling) on this dataset.