Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I balance the test set when i have highly unbalanced data?

I am using Sklearn GridSearchCv to find the best parameters for a random forest when applied to remote sensing data with 4 classes (buildings, vegetation, water and roads), the problem is I have a lot more "vegetation" classes than the rest (by a lot I mean a difference from thousands to several millions). Should I balance my testing dataset to obtain the metrics?

I already balance the whole set before i split into training and testing, this means that both datasets have the same distribution of classes in a equal manner. I am afraid this does not represent the algorithm's performance on real data, but it gives me a insight of the performance per class. If i use unbalanced data, the "vegetation" class might end up messing with the other averages.

Here's the example of the balance i do, as you can see I do it on the X and y directly. Which are the full data and labels.

if balance:
    smt = RandomUnderSampler(sampling_strategy='auto')
    X, y = smt.fit_sample(X, y)
    print("Features array shape after balance: " + str(X.shape))

I want to have the best understanding of the model's performance on the real data, but I have not found conclusive answers for this!

like image 289
AMNeves Avatar asked Apr 30 '19 12:04

AMNeves


People also ask

Should testing set be balanced?

Many studies have shown that for several base classifiers, a balanced data set provides improved overall classification performance compared to an imbalanced data set [27]–[29].

Should you balance your dataset?

One of the rules in machine learning is, its important to balance out the data set or at least get it close to balance it. The main reason for this is to give equal priority to each class in laymen terms.

Do decision trees work well with imbalanced data?

The decision tree algorithm is effective for balanced classification, although it does not perform well on imbalanced datasets. The split points of the tree are chosen to best separate examples into two groups with minimum mixing.


1 Answers

The thumb rule of dealing with imbalenced data is "Never ever balance the test data". the pipeline of dealing with imbalance data:

  1. Do preprocess
  2. Apply train test split(Stratified).
  3. Balance the training data (Generally SMOTE works better)
  4. Train model/models
  5. Test on imbalance test data(Obviously use metrics like f-score, Precision, Recall)

So that you will get the actual performance.

The question arises here is why not to balance data before train test split?

You can't expect the real world data to be balanced when you are deploying in the real world right...

A better way is to use K-fold at step 2 and do the 3,4,5 steps for each fold

Refer to this article for more info.

like image 184
Veera Srikanth Avatar answered Oct 11 '22 05:10

Veera Srikanth