Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deep Learning an Imbalanced data set

I have two data sets that looks like this:

DATASET 1
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 12)

DATASET 2
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 8)

I am trying to build a deep feedforward neural net in Tensorflow. I get accuracies in the 90s and AUC scores in the 80s. Of course, the data set is heavily imbalanced so those metrics are useless. My emphasis is on getting a good recall value and I do not want to oversample the Class 1. I have toyed with the complexity of the model to no avail, the best model predicted only 25% of the positive class correctly.

My question is, considering the distribution of these data sets, is it a futile move to build models without getting more data(I can't get more data) or there's a way around getting to work with data that is this much imbalanced.

Thanks!

like image 379
Anderlecht Avatar asked Jun 16 '17 19:06

Anderlecht


People also ask

Can neural networks handle imbalanced data?

Deep Learning for Imbalanced Classification Given the balanced focus on misclassification errors, most standard neural network algorithms are not well suited to datasets with a severely skewed class distribution. Most of the existing deep learning algorithms do not take the data imbalance problem into consideration.

How does CNN handle unbalanced datasets?

Unbalanced dataset is a common issue in all areas and does not specifically concern computer vision and problems dealt by Convolutional Neural Networks (CNNs). To tackle this problem you should try to balance your dataset, either by over-sampling minority classes or under-sampling majority classes (or both).


2 Answers

Question

Can I use tensorflow to learn imbalance classification with a ratio of about 30:1

Answer

Yes, and I have. Specifically Tensorflow provides the ability to feed in a weight matrix. Look at tf.losses.sigmoid_cross_entropy, there is a weights parameter. You can feed in a matrix that matches Y in shape and for each value of Y provide the relative weight that training example should have.

One way to find the correct weights is to start different balances and run your training and then look at your confusion matrix and a run down of precision vs accuracy for each class. Once you get both classes to have about the same precision to accuracy ratio then they are balanced.

Example Implementation

Here is an example implementation that converts a Y into a weight matrix that has performed very well for me

def weightMatrix( matrix , most=0.9 ) :
    b = np.maximum( np.minimum( most , matrix.mean(0) ) , 1. - most )
    a = 1./( b * 2. )
    weights = a * ( matrix + ( 1 - matrix ) * b / ( 1 - b ) )
    return weights

The most parameter represents the largest fractional difference to consider. 0.9 equates to .1:.9 = 1:9 , where as .5 equates to 1:1. Values below .5 don't work.

like image 185
Anton Codes Avatar answered Oct 16 '22 15:10

Anton Codes


You might be interested to have a look at this question and its answer. Its scope is a priori more restricted than yours, as it addresses specifically weights for classification, but it seems very relevant to your case.

Also, AUC is definitely not irrelevant: it is actually independent of your data imbalance.

like image 2
P-Gn Avatar answered Oct 16 '22 15:10

P-Gn