What is Imbalanced Dataset ?The dataset may contain uneven samples /instances , so that it makes the algorithm to predict with accuracy of 1.0 each time u run the model. For example, if u have simple dataset with 4 features and output(target) feature with 2 class, then total no. of instances/samples be 100. Now, out of 100, 80 instances belongs to category1 of the output(target) feature and only 20 instances contribute to the category2 of the output(target) feature. So, obviously, this makes bias in training and predicting the model. So, this dataset refers to Imbalanced dataset.
Let’s get our hands dirty by exploring the Imbalanced dataset and measures to handle the imbalanced classes.
First, for instance we can take a dataset with 7 features along with a target variable, So totally our dataset contains 8 features.
Initially, we read the dataset through “read_csv” method and print the head of the dataset as below:
1 |
|
Next, we need to find how many categories are there in the target variable “class”. So for that:
1 |
|
As you can see, there are two unique categories in the “class” feature. Now we need to find the exact counts of the two categories, to do that:
1 |
|
Well, its pretty straight forward that our target feature in dataset has more number of > “positive”> classes than negative.
So, now we can visualize this in a histogram plot, so to do that, we need to convert the “> object> ” type of Class to int:
1 |
|
It’s easy to understand when you visualize your data like this, Isn’t it ? Well, yes our dataset has more number of positive classes(1’s) and less negative classes(0’s).
Before training our model, we need to find the most important features in our dataset, so that it helps to increase the accuracy of our model and to discard the useless features that does not contribute to the overall accuracy of the model. To do that, we have our own classifier “> RandomForest> “.
1 |
|
As you can see, the feature ‘Chg‘ and ‘Lip‘ are contributing very low. So we can slice them and make the dataset with only limited features.
1 |
|
Now, to make the things clear, we split our dataset into train and split and evaluate to witness how our model predicts biased results. Let’s dive in:
1 |
|
We split our dataset into train (80%) to train our model and test (20%) to evaluate our model. So we train our model with 176 samples and test our model on 44 samples.
Now it’s time to train our model using “RandomForest” Classifier, we can train our model by:
1 |
|
As explained previously, RandomForest classifier produces accuracy of 100% , which is biased due to the fact that there are more Positive classes than the Negative class ( 143 POSITIVE classes and 77 NEGATIVE classes. )So this creates the biased results.
So, to handle this, we have two approcahes:
1.Over Sampling2.Under Sampling
Over Sampling:
It is nothing but Sampling the minority class and making it equivalent to the majority class.
Ex:
before sampling: Counter({1: 111, 0: 65})
after sampling: Counter({1: 111, 0: 111})
Note:The counts of 1’s and 0’s before and after sampling.
Under Sampling:
It is nothing but Sampling the majority class and making it equivalent to the minority class
Ex:
before sampling: Counter({1: 111, 0: 65})
after sampling: Counter({0: 65, 1: 65})
There are several algorithms for over sampling and under sampling. The one we use here is,
Over Sampling Algorithm:
1.SMOTE – “Synthetic Minority Over Sampling Technique”. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models.
Under Sampling Algorithm:
1.RandomUnderSampler – Random Undersampling aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.
2.NearMiss – selects the majority class samples whose average distances to three closest minority class samples are the smallest.Hope with all these tiny descriptions, you might have understood the overall picture of the sampling algorithms, let’s implement them in our code and check the accuracy of our model:
1 |
|
Now the counts of Positive and Negative classes are equal, as we over-sampled the Negative class to match the counts of the positive class using the SMOTE sampling algorithm. Now, let’s train our model again with the re-sampled data and evaluate.
1 |
|
Well, the accuracy is reduced to 95% which is reasonable when compared to the biased accuracy of 100%.
What you wait for ? Hurrayyyy we have learnt what is imbalanced classes in a dataset and how to handle them using various sampling algorithms practically.