Should I balance the test set when i have highly unbalanced data?

Tags:

I am using Sklearn GridSearchCv to find the best parameters for a random forest when applied to remote sensing data with 4 classes (buildings, vegetation, water and roads), the problem is I have a lot more "vegetation" classes than the rest (by a lot I mean a difference from thousands to several millions). Should I balance my testing dataset to obtain the metrics?

I already balance the whole set before i split into training and testing, this means that both datasets have the same distribution of classes in a equal manner. I am afraid this does not represent the algorithm's performance on real data, but it gives me a insight of the performance per class. If i use unbalanced data, the "vegetation" class might end up messing with the other averages.

Here's the example of the balance i do, as you can see I do it on the X and y directly. Which are the full data and labels.

if balance:
    smt = RandomUnderSampler(sampling_strategy='auto')
    X, y = smt.fit_sample(X, y)
    print("Features array shape after balance: " + str(X.shape))

I want to have the best understanding of the model's performance on the real data, but I have not found conclusive answers for this!

289

asked Apr 30 '19 12:04

AMNeves

1 Answers

The thumb rule of dealing with imbalenced data is "Never ever balance the test data". the pipeline of dealing with imbalance data:

Do preprocess
Apply train test split(Stratified).
Balance the training data (Generally SMOTE works better)
Train model/models
Test on imbalance test data(Obviously use metrics like f-score, Precision, Recall)

So that you will get the actual performance.

The question arises here is why not to balance data before train test split?

You can't expect the real world data to be balanced when you are deploying in the real world right...

A better way is to use K-fold at step 2 and do the 3,4,5 steps for each fold

Refer to this article for more info.

184

answered Oct 11 '22 05:10

Veera Srikanth

Related questions
                            
                                Appending the ColumnTransformer() result to the original data within a pipeline?
                            
                                Breakpoints are not hitting in VS Code while debugging Python Flask app
                            
                                Check if values of multiple columns are the same (python)
                            
                                How do Convolutional Layers (CNNs) work in keras?
                            
                                How to interact with a window's GUI with Python?
                            
                                Emojis in Pycharm Windows 7
                            
                                Checking if two 'time ranges' overlap with one another
                            
                                PySpark: filtering with isin returns empty dataframe
                            
                                How to make Altair plots responsive
                            
                                Pandas specifying custom holidays
                            
                                Encounter: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
                            
                                How do I install and run Pyright from the CLI instead of using VS Code?
                            
                                Compare content of two pandas dataframes even if the rows are differently ordered
                            
                                Numpy taking only first character of string
                            
                                Django: How to check if data is correct before saving it to a database on a post request?
                            
                                TypeError: 'str' object is not callable using Selenium through Python
                            
                                How to configure a tor proxy on windows?
                            
                                Is there a way to label multiple 3d surfaces in matplotlib?
                            
                                What's the fastest way to read images from urls?
                            
                                matplotlib: assigning different hatch to bars

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Should I balance the test set when i have highly unbalanced data?

Tags:

python

machine-learning

scikit-learn

random-forest

AMNeves

People also ask

1 Answers

Veera Srikanth

Recent Activity

Donate For Us