Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Imbalanced Dataset Using Keras

I am building a classifying ANN with python and the Keras library. I am using training the NN on an imbalanced dataset with 3 different classes. Class 1 is about 7.5 times as prevalent as Classes 2 and 3. As remedy, I took the advice of this stackoverflow answer and set my class weights as such:

class_weight = {0 : 1,
                1 : 6.5,
                2: 7.5}

However, here is the problem: The ANN is predicting the 3 classes at equal rates!

This is not useful because the dataset is imbalanced, and predicting the outcomes as each having a 33% chance is inaccurate.

Here is the question: How do I deal with an imbalanced dataset so that the ANN does not predict Class 1 every time, but also so that the ANN does not predict the classes with equal probability?

Here is my code I am working with:

class_weight = {0 : 1,
1 : 6.5,
2: 7.5}

# Making the ANN
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout

classifier = Sequential()


# Adding the input layer and the first hidden layer with dropout
classifier.add(Dense(activation = 'relu',
                     input_dim = 5,
                     units = 3,
                     kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate= 0.1))

#Adding the second hidden layer
classifier.add(Dense(activation = 'relu',
                     units = 3,
                     kernel_initializer = 'uniform'))
#Randomly drops 0.1, 10% of the neurons in the layer.
classifier.add(Dropout(rate = 0.1)) 

# Adding the output layer
classifier.add(Dense(activation = 'sigmoid',
                     units = 2,
                     kernel_initializer = 'uniform'))

# Compiling the ANN
classifier.compile(optimizer = 'adam',
                   loss = 'binary_crossentropy',
                   metrics = ['accuracy'])

# Fitting the ANN to the training set
classifier.fit(X_train, y_train, batch_size = 100, epochs = 100, class_weight = class_weight)
like image 299
hyCook Avatar asked Jan 31 '18 17:01

hyCook


1 Answers

The most evident problem that I see with your model is that it is not properly structured for classification. If your samples can belong to only one class at a time, then you should not overlook this fact by having a sigmoid activation as your last layer.

Ideally, the last layer of a classifier should output the probability of a sample belonging to a class, i.e. (in your case) an array [a, b, c] where a + b + c == 1..

If you use a sigmoid output, then the output [1, 1, 1] is a possible one, although it is not what you are after. This is also the reason why your model is not generalizing properly: given that you're not specifically training it to prefer "unbalanced" outputs (like [1, 0, 0]), it will defalut to predicting the average values that it sees during training, accounting for the reweighting.

Try changing the activation of your last layer to 'softmax' and the loss to 'catergorical_crossentropy':

# Adding the output layer
classifier.add(Dense(activation='softmax',
                     units=2,
                     kernel_initializer='uniform'))

# Compiling the ANN
classifier.compile(optimizer='adam',
                   loss='categorical_crossentropy',
                   metrics=['accuracy'])

If this doesn't work, see my other comment and get back to me with that info, but I'm pretty confident that this is the main problem.
Cheers

like image 162
Daniele Grattarola Avatar answered Oct 15 '22 21:10

Daniele Grattarola