Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn utils compute_class_weight function for large dataset

I am training a tensorflow keras sequential model on around 20+ GB text based categorical data in a postgres db and i need to give class weights to the model. Here is what i am doing.

class_weights = sklearn.utils.class_weight.compute_class_weight('balanced', classes, y)

model.fit(x, y, epochs=100, batch_size=32, class_weight=class_weights, validation_split=0.2, callbacks=[early_stopping])

Since i can't load the whole thing in memory i figured i can use fit_generator method in keras model.

However how can i calculate the class weights on this data? sklearn does not provide any special function for this, is it the right tool for this ?

I thought of doing it on multiple random samples but is there a better approach where whole data can be used ?

like image 570
Vibhor Avatar asked Feb 26 '20 07:02

Vibhor


People also ask

How to compute the weight of a class?

You can use the generators and also you can compute the class weights. [EDIT 1] Since you mentioned about postgres sql in the comments, I am adding the prototype answer here. first fetch the count for each classes using a separate query from postgres sql and use it to compute the class weights. you can compute it manually.

How to use class weights in a multi-output model with Keras?

Example using class weights in a multi-output model with TensorFlow Keras. The key idea for using class weights and dealing with class imbalance in a multiple-output model is to integrate the weights into a customized loss function.

How do you find the weight of a class in Python?

If ‘balanced’, class weights will be given by n_samples / (n_classes * np.bincount (y)) . If a dictionary is given, keys are classes and values are corresponding class weights.

What is the output of class_weight() method in TensorFlow?

The output of this method would be a dictionary in the format { class_label: class_weight }, which is the one required for using with TensorFlow. In a simple model that contains a single output, Tensorflow offers a parameter called class_weight in model.fit () that allows to directly specify the weights for each of the target classes.


1 Answers

You can use the generators and also you can compute the class weights.

Let's say you have your generator like this

train_generator = train_datagen.flow_from_directory(
        'train_directory',
        target_size=(224, 224),
        batch_size=32,
        class_mode = "categorical"
        )

and the class weights for the training set can be computed like this

class_weights = class_weight.compute_class_weight(
           'balanced',
            np.unique(train_generator.classes), 
            train_generator.classes)

[EDIT 1] Since you mentioned about postgres sql in the comments, I am adding the prototype answer here.

first fetch the count for each classes using a separate query from postgres sql and use it to compute the class weights. you can compute it manually. The basic logic is the count of least weighed class gets the value 1, and the rest of the classes get <1 based on the relative count to the least weighed class.

for example you have 3 classes A,B,C with 100,200,150 then class weights becomes {A:1,B:0.5,C:0.66}

let compute it manually after fetching the values from postgres sql.

[Query]

cur.execute("SELECT class, count(*) FROM table group by classes order by 1")
rows = cur.fetchall()

The above query will return rows with tuples (class name, count for each class) ordered from least to highest.

Then the below line will code will create the class weights dictionary

class_weights = {}
for row in rows:
    class_weights[row[0]]=rows[0][1]/row[1] 
    #dividing the least value the current value to get the weight, 
    # so that the least value becomes 1, 
    # and other values becomes < 1
like image 193
venkata krishnan Avatar answered Oct 19 '22 10:10

venkata krishnan