I am using huggingface TFBertModel
to do a classification task (from here: ), I am using the bare TFBertModel
with an added head dense layer and not TFBertForSequenceClassification
since I didn't see how I could use the latter using pretrained weights to only fine-tune the model.
As far as I know, fine tuning should give me about 80% or more accuracy in both BERT and ALBERT, but I am not coming even near that number:
Train on 3600 samples, validate on 400 samples
Epoch 1/2
3600/3600 [==============================] - 177s 49ms/sample - loss: 0.6531 - accuracy: 0.5792 - val_loss: 0.5296 - val_accuracy: 0.7675
Epoch 2/2
3600/3600 [==============================] - 172s 48ms/sample - loss: 0.6288 - accuracy: 0.6119 - val_loss: 0.5020 - val_accuracy: 0.7850
More epochs don't make much difference.
I am using CoLA public data set to fine-tune , this is how the data looks like:
gj04 1 Our friends won't buy this analysis, let alone the next one we propose.
gj04 1 One more pseudo generalization and I'm giving up.
gj04 1 One more pseudo generalization or I'm giving up.
gj04 1 The more we study verbs, the crazier they get.
...
And this is the code that loads the data into python:
import csv
def get_cola_data(max_items=None):
csv_file = open('cola_public/raw/in_domain_train.tsv')
reader = csv.reader(csv_file, delimiter='\t')
x = []
y = []
for row in reader:
x.append(row[3])
y.append(float(row[1]))
if max_items is not None:
x = x[:max_items]
y = y[:max_items]
return x, y
I verified that the data is in the format that I want it to be in the lists, and this is the code of the model itself:
#!/usr/bin/env python
import tensorflow as tf
from tensorflow import keras
from transformers import BertTokenizer, TFBertModel
import numpy as np
from cola_public import get_cola_data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')
bert_model.trainable = False
x_input = keras.Input(shape=(512,), dtype=tf.int64)
x_mask = keras.Input(shape=(512,), dtype=tf.int64)
_, output = bert_model([x_input, x_mask])
output = keras.layers.Dense(1)(output)
model = keras.Model(
inputs=[x_input, x_mask],
outputs=output,
name='bert_classifier',
)
model.compile(
loss=keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'],
)
train_data_x, train_data_y = get_cola_data(max_items=4000)
encoded_data = [tokenizer.encode_plus(data, add_special_tokens=True, pad_to_max_length=True) for data in train_data_x]
train_data_x = np.array([data['input_ids'] for data in encoded_data])
mask_data_x = np.array([data['attention_mask'] for data in encoded_data])
train_data_y = np.array(train_data_y)
model.fit(
[train_data_x, mask_data_x],
train_data_y,
epochs=2,
validation_split=0.1,
)
cmd_input = ''
while True:
print("Type an opinion: ")
cmd_input = input()
# print('Your opinion is: %s' % cmd_input)
if cmd_input == 'exit':
break
cmd_input_tokens = tokenizer.encode_plus(cmd_input, add_special_tokens=True, pad_to_max_length=True)
cmd_input_ids = np.array([cmd_input_tokens['input_ids']])
cmd_mask = np.array([cmd_input_tokens['attention_mask']])
model.reset_states()
result = model.predict([cmd_input_ids, cmd_mask])
print(result)
Now, no matter if I use other dataset, other number of items from the datasets, if I use a dropout layer before the last dense layer, if I give another dense layer before the last one with higher number of units or if I use Albert instead of BERT, I always have low accuracy and high loss, and often, the validation accuracy is higher than training accuracy.
I have the same results if I try to use BERT/ALBERT for NER task, always the same result, which makes me believe I systematically make some fundamental mistake in fine tuning.
I know that I have bert_model.trainable = False
and it is what I want, since I want to train only the last head and not the pretrained weights and I know that people train that way successfully. Even if I train with the pretrained weights, the results are much worse.
I see I have a very high underfit, but I just can't put my finger where I could improve here, especially seeing that people tend tohave good results with just a single dense layer on top of the model.
The default learning rate is too high for BERT. Try setting it to one of the recommended learning rates from the original paper Appendix A.3 of 5e-5, 3e-5 or 2e-5.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With