I am training a tensorflow keras sequential model on around 20+ GB text based categorical data in a postgres db and i need to give class weights to the model. Here is what i am doing.
class_weights = sklearn.utils.class_weight.compute_class_weight('balanced', classes, y)
model.fit(x, y, epochs=100, batch_size=32, class_weight=class_weights, validation_split=0.2, callbacks=[early_stopping])
Since i can't load the whole thing in memory i figured i can use fit_generator method in keras model.
However how can i calculate the class weights on this data? sklearn does not provide any special function for this, is it the right tool for this ?
I thought of doing it on multiple random samples but is there a better approach where whole data can be used ?
You can use the generators and also you can compute the class weights. [EDIT 1] Since you mentioned about postgres sql in the comments, I am adding the prototype answer here. first fetch the count for each classes using a separate query from postgres sql and use it to compute the class weights. you can compute it manually.
Example using class weights in a multi-output model with TensorFlow Keras. The key idea for using class weights and dealing with class imbalance in a multiple-output model is to integrate the weights into a customized loss function.
If ‘balanced’, class weights will be given by n_samples / (n_classes * np.bincount (y)) . If a dictionary is given, keys are classes and values are corresponding class weights.
The output of this method would be a dictionary in the format { class_label: class_weight }, which is the one required for using with TensorFlow. In a simple model that contains a single output, Tensorflow offers a parameter called class_weight in model.fit () that allows to directly specify the weights for each of the target classes.
You can use the generators and also you can compute the class weights.
Let's say you have your generator like this
train_generator = train_datagen.flow_from_directory(
'train_directory',
target_size=(224, 224),
batch_size=32,
class_mode = "categorical"
)
and the class weights for the training set can be computed like this
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
[EDIT 1] Since you mentioned about postgres sql in the comments, I am adding the prototype answer here.
first fetch the count for each classes using a separate query from postgres sql and use it to compute the class weights. you can compute it manually. The basic logic is the count of least weighed class gets the value 1, and the rest of the classes get <1 based on the relative count to the least weighed class.
for example you have 3 classes A,B,C with 100,200,150 then class weights becomes {A:1,B:0.5,C:0.66}
let compute it manually after fetching the values from postgres sql.
[Query]
cur.execute("SELECT class, count(*) FROM table group by classes order by 1")
rows = cur.fetchall()
The above query will return rows with tuples (class name, count for each class) ordered from least to highest.
Then the below line will code will create the class weights dictionary
class_weights = {}
for row in rows:
class_weights[row[0]]=rows[0][1]/row[1]
#dividing the least value the current value to get the weight,
# so that the least value becomes 1,
# and other values becomes < 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With