Handling Imbalanced Data — The Right Way

3 min readJul 12, 2021

Why is Imbalanced Data a Problem?

Unbalanced data is only a problem depending on your application. If for example, your data has two classes 0 and 1 in a 1000 rows dataset. out of which 900 rows have 0 class and 100 rows have 1 and you try to predict a certain result your algorithm will probably always say 0. This is of course correct! It is unlikely for your method to get better prediction accuracy than 90.00%.

However, in many applications, we are not interested in just the correctness of the prediction but also in why 1 happens sometimes.

How To Solve Data Imbalance Problem?

To overcome the imbalance in the dataset we often use SMOTE to upsample the existing minority class or using UNDERSAMPLER to downsample the existing majority class.

So what’s the problem?

The most common practice is to use the sampling technique on the whole training dataset which might lead to false results or most of the time into overfitting the data.

SOLUTION —

For this article, we will be going through the following steps:

Baseline
Sampling the wrong way
Sampling The Right Way

we will be using the Hospital Readmission Prediction dataset from Kaggle

1. Baseline (No Sampling)

And the results are very obvious. As the data suffers from imbalance issue the baseline model with cross-validation with using any kind of sampling we were able to recall 0.0005% of the actual 1’s in the training set and 0.0008% from the testing set which is almost equal to not getting any 1’s.

2. Sampling the Wrong Way (Sampling the whole training dataset)

We will perform undersampling for this particular dataset.

After downsampling the majority class and making it equal to the minority class we were able to recall 74% of the actual 1’s from the training data set and 63% from the testing dataset. The results are good but the difference between the test and training is around 10%.

3. Sampling the Right Way (Sampling training data using CV)

Now both the Train and the Test have around the same recall score. and we can say that our model performs and returns the same score for both the seen and the unseen data.

Conclusion —

In your problems, you should do your baseline model and the (correctly) downsampled models, and use the CV scores for your modeling decisions. The test set’s role is to tell how well your model generalizes after making all of your modeling decisions.