I am using Sklearn GridSearchCv
to find the best parameters for a random forest when applied to remote sensing data with 4 classes (buildings, vegetation, water and roads), the problem is I have a lot more "vegetation" classes than the rest (by a lot I mean a difference from thousands to several millions). Should I balance my testing dataset to obtain the metrics?
I already balance the whole set before i split into training and testing, this means that both datasets have the same distribution of classes in a equal manner. I am afraid this does not represent the algorithm's performance on real data, but it gives me a insight of the performance per class. If i use unbalanced data, the "vegetation" class might end up messing with the other averages.
Here's the example of the balance i do, as you can see I do it on the X and y directly. Which are the full data and labels.
if balance:
smt = RandomUnderSampler(sampling_strategy='auto')
X, y = smt.fit_sample(X, y)
print("Features array shape after balance: " + str(X.shape))
I want to have the best understanding of the model's performance on the real data, but I have not found conclusive answers for this!
Many studies have shown that for several base classifiers, a balanced data set provides improved overall classification performance compared to an imbalanced data set [27]–[29].
One of the rules in machine learning is, its important to balance out the data set or at least get it close to balance it. The main reason for this is to give equal priority to each class in laymen terms.
The decision tree algorithm is effective for balanced classification, although it does not perform well on imbalanced datasets. The split points of the tree are chosen to best separate examples into two groups with minimum mixing.
The thumb rule of dealing with imbalenced data is "Never ever balance the test data". the pipeline of dealing with imbalance data:
So that you will get the actual performance.
The question arises here is why not to balance data before train test split?
You can't expect the real world data to be balanced when you are deploying in the real world right...
A better way is to use K-fold at step 2 and do the 3,4,5 steps for each fold
Refer to this article for more info.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With