Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

InternalError when using TPU for training Keras model

I am attempting to fine-tune a BERT model on Google Colab from the Tensorflow Hub using this link.

However, I run into the following error:

InternalError: RET_CHECK failure (third_party/tensorflow/core/tpu/graph_rewrite/distributed_tpu_rewrite_pass.cc:2047) arg_shape.handle_type != DT_INVALID  input edge: [id=2693 model_preprocessing_67660:0 -> cluster_train_function:628]

When I run my model.fit(...) function.

This error only occurs when I try to use TPU (runs fine on CPU, but has a very long training time).

Here is my code for setting up the TPU and model:

TPU Setup:

import os
os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
strategy = tf.distribute.TPUStrategy(cluster_resolver)

Model Setup:

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer('https://tfhub.dev/google/experts/bert/wiki_books/sst2/2', trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)

Model Training

with strategy.scope():

  bert_model = build_classifier_model()
  loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
  metrics = tf.metrics.BinaryAccuracy()
  epochs = 1
  steps_per_epoch = 1280000
  num_train_steps = steps_per_epoch * epochs
  num_warmup_steps = int(0.1*num_train_steps)

  init_lr = 3e-5
  optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
  bert_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)
  print(f'Training model')
  history = bert_model.fit(x=X_train, y=y_train,
                               validation_data=(X_val, y_val),
                               epochs=epochs)

Note that X_train is a numpy array of type str with shape (1280000,) and y_train is a numpy array of shape (1280000, 1)

like image 851
a_002311 Avatar asked Dec 21 '25 08:12

a_002311


1 Answers

As I don't exactly know what changes you have made in the code... I don't have idea about your dataset. But I can see that you are trying to train the whole datset with one epoch and passing the steps per epoch directly. I would recommend to write it like this

set some batch_size 2^n power (for example 16 or 32 or etc) if you don't want to batch the dataset just set batch_size to 1

batch_size = 16
steps_per_epoch = training_data_size // batch_size

The problem with the code is most probably the training dataset size. I think that you're making a mistake by passing the value of the training dataset manually.

If you're loading the dataset from tfds use (as shown in the link):

train_dataset, train_data_size = load_dataset_from_tfds(
  in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)

If you're using a custom dataset take the size of the cleaned dataset in a variable and then use that variable for using the size of the training data. Try to avoid manually putting values in the code as far as possible.

like image 124
Chinmay Avatar answered Dec 22 '25 21:12

Chinmay



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!