Neural Network - Working with a imbalanced dataset

Tags:

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).

The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.

Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?

Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?

Thanks for your help ! Paul

Update :

Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.

1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion. However, for my situation, It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ... With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/ I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat

2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.

3) Using a smaller batch size seems indeed a good idea. I'll try it !

844

asked Jul 29 '16 17:07

Paul Rolin

1 Answers

There are usually two common ways for imbanlanced dataset:

Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection

125

answered Oct 15 '22 09:10

ppwwyyxx

Related questions
                            
                                Cannot train a neural network solving XOR mapping
                            
                                LSTM implementation with peephole
                            
                                What layers should experience "dropout" when training a Neural Network?
                            
                                Save or export weights and biases in TensorFlow for non-Python replication
                            
                                Dimensions in convolutional neural network
                            
                                How much data is actually required to train a doc2Vec model?
                            
                                batch normalization, yes or no?
                            
                                What does it mean to "break symmetry"? in the context of neural network programming? [duplicate]
                            
                                Does it make sense to build a residual network with only fully connected layers (instedad of convolutional layers)?
                            
                                Which multiplication and addition factor to use when doing adaptive learning rate in neural networks?
                            
                                Why is a bias neuron necessary for a backpropagating neural network that recognizes the XOR operator?
                            
                                RBF Neural Networks C#
                            
                                What is the difference between training function and learning function
                            
                                How do I normalize a CSV file with Encog?
                            
                                Advice for algorithm choice
                            
                                How to determine the number of feature maps to use in a convolutional neural network layer?
                            
                                Incompatible shapes on tensorflow.equal() op for correct predictions evaluation
                            
                                Can't seem to import scikit-learn's MLPRegressor
                            
                                Neural network generating incorrect results that are around the average of outputs
                            
                                Tensorboard- superimpose 2 plots

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Neural Network - Working with a imbalanced dataset

Tags:

neural-network

tensorflow

random-forest

Paul Rolin

People also ask

1 Answers

ppwwyyxx

Recent Activity

Donate For Us