I am attempting to fine-tune a BERT model on Google Colab from the Tensorflow Hub using this link.
However, I run into the following error:
InternalError: RET_CHECK failure (third_party/tensorflow/core/tpu/graph_rewrite/distributed_tpu_rewrite_pass.cc:2047) arg_shape.handle_type != DT_INVALID input edge: [id=2693 model_preprocessing_67660:0 -> cluster_train_function:628]
When I run my model.fit(...) function.
This error only occurs when I try to use TPU (runs fine on CPU, but has a very long training time).
Here is my code for setting up the TPU and model:
TPU Setup:
import os
os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
strategy = tf.distribute.TPUStrategy(cluster_resolver)
Model Setup:
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessing_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', name='preprocessing')
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer('https://tfhub.dev/google/experts/bert/wiki_books/sst2/2', trainable=True, name='BERT_encoder')
outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
return tf.keras.Model(text_input, net)
Model Training
with strategy.scope():
bert_model = build_classifier_model()
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()
epochs = 1
steps_per_epoch = 1280000
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)
init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
optimizer_type='adamw')
bert_model.compile(optimizer=optimizer,
loss=loss,
metrics=metrics)
print(f'Training model')
history = bert_model.fit(x=X_train, y=y_train,
validation_data=(X_val, y_val),
epochs=epochs)
Note that X_train is a numpy array of type str with shape (1280000,) and y_train is a numpy array of shape (1280000, 1)
As I don't exactly know what changes you have made in the code... I don't have idea about your dataset. But I can see that you are trying to train the whole datset with one epoch and passing the steps per epoch directly. I would recommend to write it like this
set some batch_size 2^n power (for example 16 or 32 or etc) if you don't want to batch the dataset just set batch_size to 1
batch_size = 16
steps_per_epoch = training_data_size // batch_size
The problem with the code is most probably the training dataset size. I think that you're making a mistake by passing the value of the training dataset manually.
If you're loading the dataset from tfds use (as shown in the link):
train_dataset, train_data_size = load_dataset_from_tfds(
in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)
If you're using a custom dataset take the size of the cleaned dataset in a variable and then use that variable for using the size of the training data. Try to avoid manually putting values in the code as far as possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With