DATA TYRANNY: A Simplified Explanation of The Dangers and Solutions to Data Imbalance.

All hell broke loose when the teacher of Primary 2A complained bitterly that most of her students occupied the lower chamber of intelligence when compared with their mates in Primary 2B. Daily, she needed to add extra sauce to her teachings before the wavelengths of her voice could pluck the chord of intelligence in the minds of her pupils.

This article deals with the following:

  • What is Data Imbalance?

  • Aims of Data Classification

  • Factors Affecting Imbalanced Data

  • Techniques for Balancing Data

  • Conclusion

The phenomenon idealized by the example above would later be known as DATA TYRANNY or DATA IMBALANCE. Frequently, in the process of data mining, quite a good number of data mongers tend to bury the subject of DATA IMBALANCE as they feel like the subject has little to no impact on most of the challenges they’re trying to use their data ingenuity to tackle. However, at the crux of this assumption lies a grave of fireflies, the grave is so well-lit that it becomes attractive only to become nothing but an eery yard when the fireflies lose their glory.

These words simply mean:

“your model will one day mess up big time if you don’t probe the balance of your dataset”.

For example, the Headmaster who divided the students into their respective classes in the example given in the opening paragraph evidently partitioned the students just along the borders of numbers neglecting the innate capabilities of the students. The reward for such a careless move would be imbalanced growth among the students.

I know you’re curious to know what Imbalanced Data is:

An Imbalanced Data is a dataset with unequal class distribution.

When you have a dataset that skews towards a particular axis, when your dataset does not capture wholly the reality of the problem you’re tackling, then, it becomes nothing less than an ice cream served hot.

Aims of Data Classification

Before delving into how to resolve this dilemma, a brain reawakening would be great. The major aim of data classification is to:

  1. Maximize output accuracy

  2. Draw training data from the generic data.

When either of these golden eggs is trampled upon, a data tyranny has been committed already. We cannot blame ourselves totally, the data collection process is purely exclusive of the jurisdiction of a Data Analyst or a Machine Learning Engineer.

Factors Affecting Imbalanced Data

Some of the factors that affect imbalanced data include the degree of class imbalance, the size of the training data, the type of classifier and the complexity of the concept being represented by data.

Among all these factors, the most striking and locale factor is the complexity of the concept being represented by data. For example, consider a model built to detect fraudulent activities in a bank.

Evidently, the ratio of persons complicit in the act of fraud would be small compared to the ratio of the non-fraudulent individuals.

A model based on this thought would fail to identify and recognize fraudulent activities when the metrics are put into it because the model only consumed “non-fraudulent data” during its training phase, hence, the model cannot generalize.

Techniques for Balancing Data

Imbalanced Data isn’t an endless loop and it can be dealt with through numerous means, the most popular of such means are:

  1. Undersampling

  2. Oversampling

  3. Oversampling

1. Undersampling: This is a non-heuristic technique that deals with the reduction in the amount of the majority class in order to match the total number of the minority class in a dataset. This method gives greater accuracy, however, during the process of “downsizing” information loss becomes inevitable and this might lead to an underfitting of our model. This is a great lesson to hold handy that even while trying to balance data, the integrity of our data should still be preserved.

2. Oversampling: This is a heuristic technique which deals with the addition of more minority class either by repetition or synthetic generation. This method involves an upsizing of the minority class to balance the majority class. This technique ensures that there is no information loss; however, the problem of overfitting might be encountered if the additional minority samples being generated do not reflect what the datasets represent.

3. Hybrid Sampling: It gives the best of both worlds as it utilizes both oversampling and undersampling to achieve data balance. The method helps to achieve the balance between removing majority class instances and creating minority class instances.

Thanks for your engagement. Dropping a comment in the comment box won’t cost you a dime, your answers might even be lurking in the curiosity you’re trying to shield from the world. Therefore, pour it all out in the comment section.

This article is one out of a series on Data Imbalance and how to address it, kindly follow me to gain access to subsequent articles that will be fashioned at each of the techniques for resolving Data Imbalance.💕💕💕