Techniques and Methods of Handling Unbalanced Data
Author: Leonel Silima
Date of Publication: 02/02/2023
Currently financial management software and platforms are built under artificial intelligence algorithms. In fact, these solutions come from raw data that is pre-processed and then subjected to machine learning algorithms. However, the question is: What is the problem arising from these applications and what are their motivations?
What is the need to handle unbalanced data?
One of the biggest problems with machine learning platforms or applications has been the inefficiency of their performance. This is often the result of not applying the best techniques on unbalanced data. However, the applications end up showing a greater probability of assertiveness in the majority of instances and fail in the minority of them . Because of this, unbalanced data applications present a greater risk of assertiveness only in one direction and not when applied in the opposite direction. Moreover, the quality of a machine learning model starts with the quality of the data in the pre-processing stage. However, when this step isn’t observed, there is a risk of generating misleading models, which bring difficulties in forecasting. Next, we will talk about the main techniques and methods to be used in the process of handling unbalanced data before the training stage with machine learning algorithms.
Unbalanced data can be defined by the small incidence of a category within a dataset (minority class) compared to the other categories (majority classes). For example, the following image illustrates the characteristics of unbalanced data.
However, if we develop a machine learning model without considering the disproportionality between the majority and minority instances, the model will fall victim to the precision_paradoxo. In particular, in this case the algorithm parameters won’t differentiate the minority class from the other categories, resulting in better results in the tests. On the other hand, worse outcomes arise when applied in real situations. Below we present a predictive model of bank loan payments based on unbalanced data.
As we see in this example, the incidence of the not pay loan category is dominant, being found in 1,400 of the cases. So predicting that each case is in the not pay loan category will be more accurate for 1,400 customers than for 200 customers in the pay loan category. In fact, this means that of the 1,400 customers, the model understands that only 14.2% are those who repay loans. Moreover, precision and recall are better measures in such cases. The underlying issue is that there is a class imbalance between the positive class and the negative class. Therefore, prior probabilities for these classes need to be accounted for in error analysis. So, precision and recall help, but precision can be biased by very unbalanced class priors in the test sets too.
Techniques and Methods of Handling Unbalanced Data
We can subdivide unbalanced data treatment into 2 large groups: OverSampling (increasing the minority instance) and UnderSampling (reducing the majority instance).
1. UnderSampling techniques
UnderSampling is a method that consists of reducing the number of observations of the class with greater frequency to reduce the difference between categories. This method has the following cases:
Technique that consists of randomly removing data from the class more frequently which leads to a serious loss of information.
NearMiss – 3
This technique adds some heuristic rules based on the nearest neighbor’s algorithm (knn) and the acceptance classifier derived from scikit-learn. In particular, NearMiss-3 is a 2-steps algorithm. First, for each negative sample, their M nearest neighbors will be kept. Then, the positive samples selected are the one for which the average distance to the N nearest-neighbors is the largest. As we can see in the figure below.
In addition, we should note that this technique reduces data loss contrary to previous practice, however this is the most recommended for UnderSampling.
2. OverSampling techniques
Contrary to UnderSampling, OverSampling consists of synthetically creating observations of the minority class, aiming to match the proportion of categories. However, this technique offers good results when the minority class doesn’t have a very large quantitative variation in its parameters. Otherwise, the model may be very good at identifying specific cases of the minority class and not the category as a whole. Among the methods, the following apply:
SMOTE (Synthetic Minority Oversampling Technique) - Oversampling
Smote is one of the most used oversampling methods to solve the imbalance problem. Specifically, it aims to balance the distribution of classes with random increase of examples of minority classes, replicating them. However, we can identify inconsistencies in this technique, so we should make improvements creating its following subset.
SMOTE + ENN
SMOTE + ENN consists of a combination of a synthetic data generation technique and a noisy data cleaning technique to avoid class overlap. It’s important to note that this technique first performs the generation of synthetic data in the minority class followed by the cleaning of misclassified instances in both classes. Thus, this makes the data more standardized for machine learning algorithms.
Next, we observe the characteristics of the data before and after applying the Smote technique.
Finally, there are many other techniques for treating unbalanced data, both for UnderSampling and OverSampling, as well as other subsets of the Smote method. However, the practices mentioned in this article are the most applied and have shown better results. It is also worth emphasizing that during the process of handling unbalanced data, we should consider all the above alternatives and the method that presents the best results in its context. In addition, it’s interesting to note that more than one type of transformation should not be applied at the same time, under the risk of losing the original information!
In the next article we will talk about processing financial data with machine learning algorithms.
Dye, S. (02 de May de 2020). Towards Data Science. How to Handle SMOTE Data in Imbalanced Classification Problems.
imbalanced-learn. (27 de 12 de 2022). imbalanced-learn.org. Obtido de imbalanced-learn.org: https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.CondensedNearestNeighbour.html#imblearn.under_sampling.CondensedNearestNeighbour
scikit-learn. (30 de 12 de 2022). scikit-learn. Obtido de scikit-learn.org: https://scikit-learn.org/stable/modules/tree.html