BERT HuggingFace gives NaN Loss

Question

I'm trying to fine-tune BERT for a text classification task, but I'm getting NaN losses and can't figure out why.

First I define a BERT-tokenizer and then tokenize my text:

from transformers import DistilBertTokenizer, RobertaTokenizer
distil_bert = 'distilbert-base-uncased' 

tokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,
                                                max_length=128, pad_to_max_length=True)

def tokenize(sentences, tokenizer):
    input_ids, input_masks, input_segments = [],[],[]
    for sentence in tqdm(sentences):
        inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=25, pad_to_max_length=True, 
                                             return_attention_mask=True, return_token_type_ids=True)
        input_ids.append(inputs['input_ids'])
        input_masks.append(inputs['attention_mask'])
        input_segments.append(inputs['token_type_ids'])        

    return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')

train = pd.read_csv('train_dataset.csv')
d = train['text']
input_ids, input_masks, input_segments = tokenize(d, tokenizer)

Next, I load my integer labels which are: 0, 1, 2, 3.

d_y = train['label']
0    0
1    1
2    0
3    2
4    0
5    0
6    0
7    0
8    3
9    1
Name: label, dtype: int64

Then I load the pretrained Transformer model and put layers on top of it. I use SparseCategoricalCrossEntropy Loss when compiling the model:

from transformers import TFDistilBertForSequenceClassification, DistilBertConfig, AutoTokenizer, TFDistilBertModel

  distil_bert = 'distilbert-base-uncased'
  optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.0000001)

  config = DistilBertConfig(num_labels=4, dropout=0.2, attention_dropout=0.2)
  config.output_hidden_states = False
  transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)

  input_ids_in = tf.keras.layers.Input(shape=(25,), name='input_token', dtype='int32')
  input_masks_in = tf.keras.layers.Input(shape=(25,), name='masked_token', dtype='int32') 

  embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
  X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
  X = tf.keras.layers.GlobalMaxPool1D()(X)
  X = tf.keras.layers.Dense(50, activation='relu')(X)
  X = tf.keras.layers.Dropout(0.2)(X)
  X = tf.keras.layers.Dense(4, activation='softmax')(X)
  model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)

  for layer in model.layers[:3]:
    layer.trainable = False

  model.compile(optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['sparse_categorical_accuracy'],
    )

Finally, I run the model using previously tokenized input_ids and input_masks as inputs to the model and get a NAN Loss after the first epoch:

model.fit(x=[input_ids, input_masks], y = d_y, epochs=3)

    Epoch 1/3
20/20 [==============================] - 4s 182ms/step - loss: 0.9714 - sparse_categorical_accuracy: 0.6153
Epoch 2/3
20/20 [==============================] - 0s 19ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
Epoch 3/3
20/20 [==============================] - 0s 20ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
<tensorflow.python.keras.callbacks.History at 0x7fee0e220f60>

EDIT: The model computes losses on the first epoch but it starts returning NaNs at the second epoch. What could be causing that problem???

Does anyone has any ideas about what I am doing wrong? All suggestions are welcomed!

Rhea Thomas · Accepted Answer

The problem would have occurred because of not specifying the num_labels

At the final output layer, by default K = 1 (number of labels), and as mentioned \sigma(\vec{z})_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}}

so while fine tuning we need to provide num_labels when going for multi class classification.

model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)

BERT HuggingFace gives NaN Loss

Tags:

machine-learning

keras

text-classification

huggingface-transformers

transformer-model

beginner

1 Answers

Rhea Thomas

Recent Activity

Donate For Us

BERT HuggingFace gives NaN Loss

Tags:

machine-learning

keras

text-classification

huggingface-transformers

transformer-model

beginner

1 Answers

Rhea Thomas

Related questions

Recent Activity

Donate For Us