Overcome Imbalance in your Datasets — PART I
by Rafael Zoghbi, Cloud Engineer at EDRANS
1. Intro
Imbalance in datasets imposes a great penalty for the accurate prediction of minority classes. This means that our model will be almost unable to predict the cases that matter the most: those where the cost of the misclassification is the greatest. This article proposes multiple techniques to solve this problem and demonstrate them in a working environment with a real dataset.
2. Imbalanced Datasets: Why it matters
A dataset is imbalanced when the classes are not approximately equally distributed. This means that there is a severe skew in the class representation. So, how severe can this skewness be? Well, there are studies that show imbalances on the order of 100 to 1 in fraud detection, as well as imbalances up to 100,000 to 1 in other applications. This kind of use case, where there are very few samples of a class relative to others, maybe seen as finding a needle in a haystack problem.
Let’s dive a little further into this. Imagine we have a dataset where classes appear in a 999:1 ratio. The algorithm is clever, it has mostly seen one type of case, therefore the classifier will try to predict every example as if it belongs to the majority class. And by doing so, we would have 99% accuracy in our model. This is a tough benchmark to beat, even for an algorithm. However, and no matter how high our accuracy is, this approach has bigger problems to address.
First of all, it assumes equal error costs. This means that the error of misclassifying an observation has the same consequences regardless of the class.
In the real world, things are different. Classification often leads to action, and actions have an effect. In cases where the prediction of the minority class is most important, having a random guess approach is simply not tolerable. Because misclassifying the minority class may have consequences such as allowing a fraudulent transaction, ignoring a malfunctioning part, and not detecting a disease. For example, a typical mammography dataset might contain 98% normal pixels and 2% abnormal pixels. The nature of this kind of application requires a very high rate of detection in the minority class and allows for a small error rate in the majority class.
The consequences of misclassification may be severe, and performing the incorrect actions may be quite costly. On very rare occasions, the costs of mistakes are equivalent. In fact, it’s hard to think about a domain where the classification is indifferent to whether it makes Type 1 (False Positive) or Type 2 (False Negative) errors.
3. Addressing the Imbalance
Now that we understand how the bias in the dataset can influence machine learning algorithms, let’s talk a little bit about how the machine learning community has addressed the issue of class imbalance. There are roughly two ways to deal with this problem: One way is to assign distinct costs to train examples, and the other is to resample the original dataset, either by oversampling the minority class and/or undersampling the majority class.
In this article, we will focus the discussion on the resampling of the original dataset and will leave the adjustment of training costs for future discussions.
As stated before, we can establish balance in our dataset by increasing the minority class to make it match the majority class, and/or cutting down the majority class to match the less represented class. Both techniques have their advantages and disadvantages. Especially when there are multiple ways to oversample the majority class.
One way we can oversample the majority class is by randomly selecting samples from the minority class, and adding them to the training dataset until the desired class distribution is achieved in the training dataset. Essentially, this would be randomly duplicating observations. This approach is also referred to as naive resampling because it assumes nothing about the data. Its major advantage is that it is simple to implement and fast to execute, however, in line with former investigations, it does not have a major impact on our model efficiency.
According to previous research, undersampling the majority class is another way to address the imbalance in datasets. It enables better classifiers to be built when compared to random oversampling of the minority class.
Another alternative is to oversample the minority class by creating “synthetic” examples, rather than randomly oversampling with replacement. Enter the Synthetic Minority Oversample Technique or ‘SMOTE’.
4. Synthetic Minority Oversample Technique
4.1. Introduction
The synthetic minority oversampling technique or SMOTE is an oversampling technique that generates new synthetic training samples by performing operations on real samples. SMOTE increases the minority class by taking each minority class sample and inserting synthetic examples along the line segments joining any or all K-nearest neighbors of the minority class. Depending on the amount of oversampling required, the neighbors of the K nearest neighbors are randomly selected. Let’s slow down a bit on this part and try to elaborate more on how SMOTE works.
Figure 1 shows a two-dimensional representation of a heavily imbalanced dataset. As we can see, the green class is underrepresented relative to the blue class. We will show how SMOTE would work its oversampling magic. In order to show this, first, let’s zoom in a little in the green class to have a better view.
Alright, so we now have a zoomed-in picture of a portion of the green, minority class. What will happen now is that SMOTE will first select a minority class instance at random and find its K nearest minority class neighbors. Then, it will choose one of the K nearest neighbors at random and connect both to form a line segment in the feature space. The synthetic instance will be created at a random point of the segment.
This procedure will be repeated enough times with different data points until the minority class has about the same proportion as the majority class.
From our previous explanation, we highlight two main ideas: First, in the simplest form, SMOTE will only work with continuous features. For categorical features, the proposed method for generating synthetic samples will generate instances in a continuous spectrum and this will break the coherence in our dataset.
Second, because of the K Nearest Neighbors threshold, SMOTE will create synthetic observations where there is a high density of samples, and fewer synthetic samples will be created on the inter-class boundaries. If misclassification often happens near the boundary decision, the K Nearest Neighbors approach will not create samples that reflect the reality of our use case and will not help improve the classification score.
As you can see, depending on the characteristics of our dataset and our use case, the default behavior of SMOTE may not be the better fit to overcome the imbalance in our dataset. For this reason, different variants of SMOTE have been developed for cases where, in its simplest form, the Synthetic Minority Oversample Technique comes short.
4.2. Techniques
As described earlier, apart from the default SMOTE behavior, described in previous paragraphs, there are other oversampling techniques based on SMOTE that may enhance our oversampling outcome. Let’s dive a little deeper into those.
4.2.2. SMOTE for Nominal and Continuous.
In its default form, SMOTE works only with continuous features. When we are working with categorical data, SMOTE will create samples in a continuous space and we would end up with data that does not make any sense. For example, imagine we are trying to oversample a categorical feature such as “CustomerType” that has 4 possible attributes that we have encoded to values from 1 to 4. If we oversampled this data with SMOTE, we could end up with samples such as 1.34 or 2.5, which would not make any sense and would make distortion in our model. The premise behind SMOTE-NC is simple, we need to specify which features are categorical, and for these, it will pick the most frequent category of the nearest neighbors present during generation. SMOTE-NC expects categorical and continuous features in our dataset, if we have an all-categorical dataset, then SMOTEN will be our choice. SMOTEN follows the same premise as SMOTE-NC for creating synthetic observations but will expect an all-categorical dataset.
4.2.3. SVM SMOTE
Support Vector Machine SMOTE (SVM-SMOTE) is an oversampling method that focuses on the minority class instances lying around the borderline between classes. Due to the fact that this area is most crucial for establishing the decision boundary, new instances will be generated in such a manner that the minority class area will be expanded toward the side of the majority class at the places where there appear few majority class instances.
The objective of the Support Vector Machine is to find the optimal hyperplane that separates the positive and negative classes with a maximum margin. The borderline area is approximated by the support vectors obtained after training a standard SVM classifier on the original dataset, and new instances will be randomly created along the lines joining each minority class support vector with a number of its nearest neighbors using interpolation or extrapolation technique, depending on the density of majority class instances around it.
When to choose SVM SMOTE over regular SMOTE will depend entirely by the prediction model targets and the business affected by it. If misclassification happens near the boundary decision, then SVN SMOTE may be better suited for our oversampling process.
4.2.4. K-Means SMOTE
K-Means-SMOTE uses a combination of K-Means clustering and SMOTE oversampling to overcome some of the default SMOTE shortcomings. The use of clustering enables the proposed oversampler to identify target areas of the input space where the generation of artificial data is most effective. The method aims at eliminating both between-class imbalances and within-class imbalances while at the same time avoiding the generation of noisy samples. K-Means SMOTE involves three steps: Clustering, Filtering and Oversampling. In the clustering step, the input space is aggregated into K groups using K-Means clustering. The K-Means algorithm is an iterative method of finding naturally occurring groups in data that can be represented in a Euclidean space. The most notable hyperparameter of the K-Means algorithm is the k itself, the number of clusters. It is essential for the effectiveness of K-Means SMOTE to find an appropriate value for k, as it defines how many minority clusters can be found in the filtering step.
The next step is to filter and choose clusters to be oversampled and determine how many samples are to be generated in each cluster. The idea is to oversample clusters dominated by the minority class, as applying SMOTE inside minority areas is less susceptible to noise generation. The selection of clusters for oversampling is based on each cluster’s proportion of minority and majority instances. This hyperparameter of the Imbalance Ratio is also adjustable. Finally, in the oversampling step, each filtered cluster is oversampled using SMOTE. The proposed method, as you can see, relies on an unsupervised approach that enables the discovery of overlapping class regions and may aid the avoiding oversampling in unsafe areas.
Of course, when we talk about resampling with SMOTE and its variants, no matter which, there is a lot going on under the hood, and trying to explain in detail the procedure and the math that are actually running on each technique is out of the scope of this article. The techniques described earlier are the result of intensive research from the community and, even though they are highly complex in their theory, we don’t need to know exactly the mathematical procedure that is happening behind the curtains of SMOTE, because they are available as libraries and methods we can use, and that is exactly what we are going to do.
Now that we understand why we need SMOTE and how it works, it’s time to get our hands dirty and show how to implement it using Imbalanced-Learn, an open-source library that relies on Scikit-learn to provide tools when dealing with imbalanced datasets
5. Implementing SMOTE
In this section, we will demonstrate the working features of SMOTE and its different variants. The idea is to work with a highly imbalanced dataset where we can try different oversampling techniques discussed in the article and see how they perform.
5.1 Describing the dataset
The dataset we will be working with is a Stroke Prediction Dataset. This dataset is used to predict whether a patient is likely to get a stroke based on various parameters (link to the dataset: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset)
As for every Machine Learning workflow, we should perform some EDA. So, let’s take a look at the dataset and, of course, let’s zoom in on the imbalance characteristic.
As you can see, there are features that describe some characteristics of the patients. Also, we can check out our label, stroke, which will most likely show an important skew. From this initial review, we can also verify that there are Null or Empty values, string type variables, and irrelevant columns. We will take care of those to focus on overcoming the Imbalance.
After the preprocessing stage, it’s easier for us to analyze the imbalance and try to think about a machine learning approach to develop a trustworthy model for preventing strokes.
Having little knowledge about the issue we are dealing with, we can infer that of our whole sample of the population, only a few will present a stroke and, even though is the second cause of death globally, from a sample perspective, we will have a lot more negative cases than positive ones.
Roughly 2% of the whole sample has suffered a stroke. This is a case of a highly imbalanced dataset. Notice that we have less than a thousand positive cases and over 42 thousand negative cases.
We can always rely on visualization to help us understand this bias, even though in our case it’s a no-brainer.
Great, so we have the perfect dataset to demonstrate our beloved oversampling techniques.
5.2 Overcoming the imbalance
What we will do now is to partition the dataset into Train and Test. From there, we will oversample the Train datasets using different techniques, we will train or model using XGBoost algorithm, and we will see the impact on our metrics using a confusion matrix to see the results of our inferences on the Test subsample.
One important note here. A very important one, actually: You will always perform your oversampling in your Training dataset. Never oversample the Testing dataset.
Why? You may ask, it’s because oversampling will generate samples synthetically and will alter the distribution of your classes. If we do it on the Test dataset we’ll introduce unneeded noise that may cause overfitting. So, the best way to do it is to pull out your testing dataset and do your oversampling just on the training dataset.
In our case, this is how we suggest our Splitting and Oversampling.
The oversampling phase is quite simple to implement thanks to libraries such as imbalance-learn; this is how we implemented the default SMOTE versions. For further references, check out the repo shared along with the article for full details.
In the previous code block we can see how we are implementing SMOTE. One thing to note here is that we are not equalizing the two classes, mainly because if we do so, we will have to create over 34 thousand synthetic samples and this will definitely distort our performance. So, we will oversample, but only to a certain degree, where our minority class will reach roughly 50% of the majority class.
This is what it looks like:
We did exactly the same thing for the other SMOTE techniques to create our remaining datasets. The imbalanced-learn library makes it very simple to call different methods for every oversampling technique.