Training a BERT-based model causes an OutOfMemory error. How do I fix this?

Question

My setup has an NVIDIA P100 GPU. I am working on a Google BERT model to answer questions. I am using the SQuAD question-answering dataset, which gives me questions, and paragraphs from which the answers should be drawn, and my research indicates this architecture should be OK, but I keep getting OutOfMemory errors during training:

ResourceExhaustedError: OOM when allocating tensor with shape[786432,1604] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node dense_3/kernel/Initializer/random_uniform/RandomUniform}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Below, please find a full program that uses someone else's implementation of Google's BERT algorithm inside my own model. Please let me know what I can do to fix my error. Thank you!

import json
import numpy as np
import pandas as pd
import os
assert os.path.isfile("train-v1.1.json"),"Non-existent file"
from tensorflow.python.client import device_lib
import tensorflow.compat.v1 as tf
#import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import re
regex = re.compile(r'\W+')
#Reading the files.
def readFile(filename):
  with open(filename) as file:
    fields = []
    JSON = json.loads(file.read())
    articles = []
    for article in JSON["data"]:
      articleTitle = article["title"]
      article_body = []
      for paragraph in article["paragraphs"]:
        paragraphContext = paragraph["context"]
        article_body.append(paragraphContext)
        for qas in paragraph["qas"]:
          question = qas["question"]
          answer = qas["answers"][0]
          fields.append({"question":question,"answer_text":answer["text"],"answer_start":answer["answer_start"],"paragraph_context":paragraphContext,"article_title":articleTitle})
      article_body = "\n".join(article_body)
      article = {"title":articleTitle,"body":article_body}
      articles.append(article)
  fields = pd.DataFrame(fields)
  fields["question"] = fields["question"].str.replace(regex," ")
  assert not (fields["question"].str.contains("catalanswhat").any())
  fields["paragraph_context"] = fields["paragraph_context"].str.replace(regex," ")
  fields["answer_text"] = fields["answer_text"].str.replace(regex," ")
  assert not (fields["paragraph_context"].str.contains("catalanswhat").any())
  fields["article_title"] = fields["article_title"].str.replace("_"," ")
  assert not (fields["article_title"].str.contains("catalanswhat").any())
  return fields,JSON["data"]
trainingData,training_JSON = readFile("train-v1.1.json")
print("JSON dataset read.")
#Text preprocessing
## Converting text to skipgrams
print("Tokenizing sentences.")
strings = trainingData.drop("answer_start",axis=1)
strings = strings.values.flatten()

answer_start_train_one_hot = pd.get_dummies(trainingData["answer_start"])

# @title Keras-BERT Environment
import os
pretrained_path = 'uncased_L-12_H-768_A-12'
config_path = os.path.join(pretrained_path, 'bert_config.json')
checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt')
vocab_path = os.path.join(pretrained_path, 'vocab.txt')
# Use TF_Keras
os.environ["TF_KERAS"] = "1"

# @title Load Basic Model
import codecs
from keras_bert import load_trained_model_from_checkpoint
token_dict = {}
with codecs.open(vocab_path, 'r', 'utf8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

model = load_trained_model_from_checkpoint(config_path, checkpoint_path)

#@title Model Summary
model.summary()

#@title Create tokenization stuff.
from keras_bert import Tokenizer

tokenizer = Tokenizer(token_dict)
def tokenize(text,max_len):
  tokenizer.tokenize(text)
  return tokenizer.encode(first=text,max_len=max_len)
def tokenize_array(texts,max_len=512):
  indices = np.zeros((texts.shape[0],max_len))
  segments = np.zeros((texts.shape[0],max_len))
  for i in range(texts.shape[0]):
    tokens = tokenize(texts[i],max_len)
    indices[i] = tokens[0]
    segments[i] = tokens[1]
  #print(indices.shape)
  #print(segments.shape)
  return np.stack([segments,indices],axis=1)

#@ Tokenize inputs.
def X_Y(dataset,answer_start_one_hot,batch_size=10):
    questions = dataset["question"]
    contexts = dataset["paragraph_context"]
    questions_tokenized = tokenize_array(questions.values)
    contexts_tokenized = tokenize_array(contexts.values)
    X = np.stack([questions_tokenized,contexts_tokenized],axis=1)
    Y = answer_start_one_hot
    return X,Y
def X_Y_generator(dataset,answer_start_one_hot,batch_size=10):
    while True:
        try:
            batch_indices = np.random.choice(np.arange(0,dataset.shape[0]),size=batch_size)
            dataset_batch = dataset.iloc[batch_indices]
            X,Y = X_Y(dataset_batch,answer_start_one_hot.iloc[batch_indices])
            max_int = pd.concat((trainingData["answer_start"],devData["answer_start"])).max()
            yield (X,Y)
        except Exception as e:
            print("Unhandled exception in X_Y_generator: ",e)
            raise

model.trainable = True

answers_network_checkpoint = ModelCheckpoint('answers_network-best.h5', verbose=1, monitor='val_loss',save_best_only=True, mode='auto')

input_layer = Input(shape=(2,2,512,))
print("input layer: ",input_layer.shape)
questions_input_layer = Lambda(lambda x: x[:,0])(input_layer)
context_input_layer = Lambda(lambda x: x[:,1])(input_layer)
print("questions input layer: ",questions_input_layer.shape)
print("context input layer: ",context_input_layer.shape)
questions_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(questions_input_layer)
print("questions indices layer: ",questions_indices_layer.shape)
questions_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(questions_input_layer)
print("questions segments layer: ",questions_segments_layer.shape)
context_indices_layer = Lambda(lambda x: tf.cast(x[:,0],tf.float64))(context_input_layer)
context_segments_layer = Lambda(lambda x: tf.cast(x[:,1],tf.float64))(context_input_layer)
questions_bert_layer = model([questions_indices_layer,questions_segments_layer])
print("Questions bert layer loaded.")
context_bert_layer = model([context_indices_layer,context_segments_layer])
print("Context bert layer loaded.")
questions_flattened = Flatten()(questions_bert_layer)
context_flattened = Flatten()(context_bert_layer)
combined = Concatenate()([questions_flattened,context_flattened])
#bert_dense_questions = Dense(256,activation="sigmoid")(questions_flattened)
#bert_dense_context = Dense(256,activation="sigmoid")(context_flattened)
answers_network_output = Dense(1604,activation="softmax")(combined)
#answers_network = Model(inputs=[input_layer],outputs=[questions_bert_layer,context_bert_layer])
answers_network = Model(inputs=[input_layer],outputs=[answers_network_output])
answers_network.summary()

answers_network.compile("adam","categorical_crossentropy",metrics=["accuracy"])

answers_network.fit_generator(
    X_Y_generator(
        trainingData,
        answer_start_train_one_hot,
        batch_size=10),
    steps_per_epoch=100,
    epochs=100,
    callbacks=[answers_network_checkpoint])

My vocabulary size is about 83,000 words. Any model with a "good" accuracy/F1 score is preferred, but I am also on a non-extensible deadline in 5 days.

EDIT:

Unfortunately, there was one thing I didn't mention: I am actually using CyberZHG's keras-bert module for preprocessing, and for the actual BERT model, so some optimizations may actually break the code. For example, I tried setting the default float value to float16, but this caused a compatibility error.

EDIT #2:

By request, here's the code for my full program:

Jupyter notebook

thushv89 · Accepted Answer

Edit: I have edited my response in place rather than increasing the length of the already long response.

After looking at the issue rises from the final layer in your model. And I was able to get it to work with the following fixes/changes.

ResourceExhaustedError: OOM when allocating tensor with shape[786432,1604] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node dense_3/kernel/Initializer/random_uniform/RandomUniform}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

So, looking at the error the problem is not being able to allocate an array of [786432,1604]. If you do a simple calculation you have 5GB array allocated here (assuming float32). If it is float64 that goes to 10GB. Add the parameters coming from Bert and other layers in the model, viola! you run out of memory.

The issues

Data type

Looking at the code all these layers in your answer network are producing float64 because you are specifying float64 for all your Lambda layers. So my first suggestion is,

Setting it globally should fix the problemtf.keras.backend.set_floatx('float16')

And as a precaution,

question_indices_layer = Input(shape=(256,), dtype='float16')
question_segments_layer = Input(shape=(256,), dtype='float16')
context_indices_layer = Input(shape=(256,), dtype='float16')
context_segments_layer = Input(shape=(256,), dtype='float16')
questions_bert_layer = model([question_indices_layer,question_segments_layer])
context_bert_layer = model([context_indices_layer,context_segments_layer])
questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])

Now you will have all your layers float16.

Squashing the output before that last `softmax` layer

Another thing you can do is, without passing a massive [batch size, 512, 768] output to your dense layer, you squash it using a smaller layer or some transformation. Few things you can try are,

Adding smaller dense layers that reduces the dimensionality before feeding it to the 1604 softmax layer. This reduces the model parameters significantly.

questions_flattened = Flatten(dtype=tf.float16)(questions_bert_layer)
questions_flattened = Dense(64, activation='relu',dtype=tf.float16)(questions_flattened)
contexts_flattened = Flatten(dtype=tf.float16)(context_bert_layer)
contexts_flattened = Dense(64,activation="relu",dtype=tf.float16)(contexts_flattened)
combined = Concatenate(dtype=tf.float16)([questions_flattened,contexts_flattened])

Summing/Averaging over time dimension of the question output. Because, you only care about understanding what the question is, so it would be fine to lose positional information from that output. You can do this the following way,

questions_flattened = Lambda(lambda x: K.sum(x, axis=1))(questions_bert_layer)

Instead of Concatenate try Add() so that you don't increase the dimensionality.
You can try any of these (optional while combining with others in the list). But make sure you match dimensions of questions_flattend and answers_flattened when doing these in combination, as otherwise you'll get errors.

Length or the sequence

The next problem is that your input length is 512. I'm not sure how you arrived at that number but I think you can do better well below that number. For example you get the following statistics for questions and paragraphs.

count    175198.000000
mean         11.217582
std           3.597345
min           1.000000
25%           9.000000
50%          11.000000
75%          13.000000
max          41.000000
Name: question, dtype: float64

count    175198.000000
mean        123.791653
std          50.541241
min          21.000000
25%          92.000000
50%         114.000000
75%         147.000000
max         678.000000
Name: paragraph_context, dtype: float64

You can get this information as,

pd.Series(trainingData["question"]).str.split(' ').str.len().describe()

As an example, when you pad your sequences using pad_sequences you don't specify a maxlen which leads to padding sentences to the maximum length found in the corpus. For example you'd have a 678 elements long paragraph context, where 75% of the data is under 150 words long.

I'm not exactly sure how these values play into the length 512 but I hope you get my point. From the looks of it it seems you can do fine with a length of 150.

Vocabulary sizes

You can also reduce the vocabulary.

A good way of deciding this number would be to set the number of unique words that appear more than n times in your corpus (n can be 10-25 or better do some further analysis and find an optimal value.).

For example you can get vocabulary stats as follows.

counts = sorted([(k, v) for k, v in list(textTokenizer.word_counts.items())], key=lambda x: x[1])

Which gives you word frequency combinations. You will see that around 37000 words appear less than (or approximately) 10 times. So you can set the vocabulary size of the tokenizer to something smaller.

textTokenizer = Tokenizer(num_words=50000, oov_token='unk')

But keep in mind that word_index still contain all the words. So you need to make sure you remove these rare words when you pass it as token_dict.

Batch size

You seem to be setting batch_size=10 which should be fine. But to get better results (and hopefully with more memory once you do the above suggestions), go for a higher batch size like 32 or 64, which will improve performance.

Training a BERT-based model causes an OutOfMemory error. How do I fix this?

Tags:

python

tensorflow

keras

Montana Burr

1 Answers

Data type

Squashing the output before that last `softmax` layer

Length or the sequence

Vocabulary sizes

Batch size

thushv89

Recent Activity

Donate For Us

Training a BERT-based model causes an OutOfMemory error. How do I fix this?

Tags:

python

tensorflow

keras

Montana Burr

1 Answers

Data type

Squashing the output before that last softmax layer

Length or the sequence

Vocabulary sizes

Batch size

thushv89

Related questions

Recent Activity

Donate For Us

Squashing the output before that last `softmax` layer