Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Classification: skewed data within a class

I'm trying to build a multilabel-classifier to predict the probabilities of some input data being either 0 or 1. I'm using a neural network and Tensorflow + Keras (maybe a CNN later).

The problem is the following: The data is highly skewed. There are a lot more negative examples than positive maybe 90:10. So my neural network nearly always outputs very low probabilities for positive examples. Using binary numbers it would predict 0 in most of the cases.

The performance is > 95% for nearly all classes, but this is due to the fact that it nearly always predicts zero... Therefore the number of false negatives is very high.

Some suggestions how to fix this?

Here are the ideas I considered so far:

  1. Punishing false negatives more with a customized loss function (my first attempt failed). Similar to class weighting positive examples inside a class more than negative ones. This is similar to class weights but within a class. How would you implement this in Keras?

  2. Oversampling positive examples by cloning them and then overfitting the neural network such that positive and negative examples are balanced.

Thanks in advance!

like image 344
BugridWisli Avatar asked Feb 20 '18 07:02

BugridWisli


People also ask

How do you handle skewed data in classification?

Different ways to deal with an imbalanced dataset A widely adopted technique for dealing with highly unbalanced datasets is called resampling. Resampling is done after the data is split into training, test and validation sets. Resampling is done only on the training set or the performance measures could get skewed.

Does skewness affect classification?

Skewness affects the classification of the dataset samples. While class skewness biases the classification towards majority classes, skewed features may also bias the classification as they are significant for few classes.

How will you deal with skewed class in a binary classification problem having many features?

The simplest thing you could try would be to reduce the size of the majority class of your training set. Just randomly sample (without replacement) N instances form the majority class, where N is the number of instances in the minority class. This is called 'undersampling.


1 Answers

You're on the right track.

Usually, you would either balance your data set before training, i.e. reducing the over-represented class or generate artificial (augmented) data for the under-represented class to boost its occurrence.

  1. Reduce over-represented class This one is simpler, you would just randomly pick as many samples as there are in the under-represented class, discard the rest and train with the new subset. The disadvantage of course is that you're losing some learning potential, depending on how complex (how many features) your task has.

  2. Augment data Depending on the kind of data you're working with, you can "augment" data. That just means that you take existing samples from your data and slightly modify them and use them as additional samples. This works very well with image data, sound data. You could flip/rotate, scale, add-noise, in-/decrease brightness, scale, crop etc. The important thing here is that you stay within bounds of what could happen in the real world. If for example you want to recognize a "70mph speed limit" sign, well, flipping it doesn't make sense, you will never encounter an actual flipped 70mph sign. If you want to recognize a flower, flipping or rotating it is permissible. Same for sound, changing volume / frequency slighty won't matter much. But reversing the audio track changes its "meaning" and you won't have to recognize backwards spoken words in the real world.

Now if you have to augment tabular data like sales data, metadata, etc... that's much trickier as you have to be careful not to implicitly feed your own assumptions into the model.

like image 194
Norms Avatar answered Oct 01 '22 21:10

Norms