I want to use BERT model to do multi-label classification with Tensorflow.
To do so, I want to adapt the example run_classifier.py
from BERT github repository, which is an example on how to use BERT to do simple classification, using the pre-trained weights given by Google Research. (For example with BERT-Base, Cased
)
I have X
different labels which have value of either 0 or 1, so I want to add to the original BERT model a new Dense layer of size X
and using the sigmoid_cross_entropy_with_logits
activation function.
So, for the theorical part I think I am OK.
The problem is that I don't know how I can append a new output layer and retrain only this new layer with my dataset, using the existing BertModel
class.
Here is the original create_model()
function from run_classifier.py
where I guess I have to do my modifications. But I am a bit lost on what to do.
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
And here is the same function, with some of my modifications, but where there is things missing (and wrong things too? )
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, labels, num_labels):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids)
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable("output_weights", [num_labels, hidden_size],initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable("output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
per_example_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
The other things I have adapted in the code and for which I had no problem :
So, if anyone knows what I should do to resolve my problem, or even point out some obvious mistake I may have done, I would be glad to hear it.
Notes :
In this article, we will focus on application of BERT to the problem of multi-label text classification. Traditional classification task assumes that each document is assigned to one and only on class i.e. label. This is sometimes termed as multi-class classification or sometimes if the number of classes are 2, binary classification.
Looking for text data I could use for a multi-label multi-class text classification task, I stumbled upon the ‘Consumer Complaint Database’ from data.gov. Seems to do the trick, so that’s what we’ll use. Next up is the exploratory data analysis.
In doing so, you’ll learn how to use a BERT model from Transformer as a layer in a Tensorflow model built using the Keras API. The internet is full of text classification articles, most of which are BoW-models combined with some kind of ML-model typically solving a binary text classification problem.
The main libraries we need are a) Hugging Face Transformers (for BERT Model and Tokenizer), b) PyTorch (DL framework & Dataset prep), c) PyTorch Lightning (Model Definition and Training), d) Sklearn (for splitting dataset & metrics) and e) BeautifulSoup (for removing out HTML tags from the raw text in the given data).
You want to replace the softmax that models a single distribution over possible outputs (all scores sum up to one) with sigmoid which models an independent distribution for each class (there is yes/no distribution for each output).
So, you correctly change the loss function, but you also need to change how you compute the probabilities. It should be:
probabilities = tf.sigmoid(logits)
In this case, you don't need the log_probs
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With