Deep Learning an Imbalanced data set

Tags:

I have two data sets that looks like this:

DATASET 1
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 12)

DATASET 2
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 8)

I am trying to build a deep feedforward neural net in Tensorflow. I get accuracies in the 90s and AUC scores in the 80s. Of course, the data set is heavily imbalanced so those metrics are useless. My emphasis is on getting a good recall value and I do not want to oversample the Class 1. I have toyed with the complexity of the model to no avail, the best model predicted only 25% of the positive class correctly.

My question is, considering the distribution of these data sets, is it a futile move to build models without getting more data(I can't get more data) or there's a way around getting to work with data that is this much imbalanced.

Thanks!

379

asked Jun 16 '17 19:06

Anderlecht

2 Answers

Question

Can I use tensorflow to learn imbalance classification with a ratio of about 30:1

Answer

Yes, and I have. Specifically Tensorflow provides the ability to feed in a weight matrix. Look at tf.losses.sigmoid_cross_entropy, there is a weights parameter. You can feed in a matrix that matches Y in shape and for each value of Y provide the relative weight that training example should have.

One way to find the correct weights is to start different balances and run your training and then look at your confusion matrix and a run down of precision vs accuracy for each class. Once you get both classes to have about the same precision to accuracy ratio then they are balanced.

Example Implementation

Here is an example implementation that converts a Y into a weight matrix that has performed very well for me

def weightMatrix( matrix , most=0.9 ) :
    b = np.maximum( np.minimum( most , matrix.mean(0) ) , 1. - most )
    a = 1./( b * 2. )
    weights = a * ( matrix + ( 1 - matrix ) * b / ( 1 - b ) )
    return weights

The most parameter represents the largest fractional difference to consider. 0.9 equates to .1:.9 = 1:9 , where as .5 equates to 1:1. Values below .5 don't work.

185

answered Oct 16 '22 15:10

Anton Codes

You might be interested to have a look at this question and its answer. Its scope is a priori more restricted than yours, as it addresses specifically weights for classification, but it seems very relevant to your case.

Also, AUC is definitely not irrelevant: it is actually independent of your data imbalance.

answered Oct 16 '22 15:10

P-Gn

Related questions
                            
                                Function to determine a reasonable initial guess for scipy.optimize?
                            
                                Selecting the components showing the most variance in PCA
                            
                                How to use sklearn Pipeline with custom Features?
                            
                                Caffe sigmoid cross entropy loss
                            
                                How to obtain a confidence interval or a measure of prediction dispersion when using xgboost for classification?
                            
                                SVM - Difference between Energy vs Loss vs Regularization vs Cost function
                            
                                Keras RNN loss does not decrease over epoch
                            
                                Difference between LinearRegression() and Ridge(alpha=0)
                            
                                Image resizing method during preprocessing for neural network
                            
                                GridSearch with Keras Neural Networks
                            
                                Gradient calculation in Hamming loss for multi-label classification
                            
                                Dimension mismatch error in Spark ML
                            
                                How to save the encoded output in Keras
                            
                                tf.cond lowers the training speed
                            
                                How to convert Euclidean distance to range 0 and 1 like Cosine Similarity?
                            
                                Is it possible to get the objective function value during each training step?
                            
                                Binary Crossentropy to penalize all components of one-hot vector
                            
                                Is it possible to certify an AI-based solution for safety-critical systems? [closed]
                            
                                Least Squares method in practice
                            
                                Why is binary_crossentropy more accurate than categorical_crossentropy for multiclass classification in Keras?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Deep Learning an Imbalanced data set

Tags:

machine-learning

tensorflow

deep-learning