I earlier posted this question. I wanted to get embedding similar to this youtube video, time 33 minutes onward.
1) I dont think that the embedding that i am getting from CLS
token are similar to what is shown in the youtube video. I tried to perform semantic similarity and got horrible results. Could someone confirm whether embedding that i am getting are similar to embedding mentioned at 35.27 mark of the video?
2) If the answer of the above question is 'not similar' then how could i get the embedding that I am looking for using the code that i have written?
3) If the answer of the 1st question is 'they are similar' then why am i getting horrible results? do i need to finetune using more data?
The code that i used to fine tune is below. It comes from this page. Few changes were made to that code to return CLS
embedding. Those changes were based upon answers given to my question
train_InputExamples = train2.apply(lambda x: run_classifier.InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
text_a = x[DATA_COLUMN],
text_b = None,
label = x[LABEL_COLUMN]), axis = 1)
"""
test_InputExamples = test2.apply(lambda x: run_classifier.InputExample(guid=None,
text_a = x[DATA_COLUMN],
text_b = None,
label = x[LABEL_COLUMN]), axis = 1)
"""
# In[17]:
# This is a path to an uncased (all lowercase) version of BERT
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
# In[18]:
#Create tokenizer function using local albert model hub
def create_tokenizer_from_hub_module():
"""Get the vocab file and casing info from the Hub module."""
with tf.Graph().as_default():
bert_module = hub.Module(BERT_MODEL_HUB)
tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
with tf.Session() as sess:
vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
tokenization_info["do_lower_case"]])
return tokenization.FullTokenizer(
vocab_file=vocab_file, do_lower_case=do_lower_case, spm_model_file=vocab_file)
tokenizer = create_tokenizer_from_hub_module()
#Test tokenizer on a sample sentence
tokenizer.tokenize("This here's an example of using the ALBERT tokenizer")
# In[19]:
# We'll set sequences to be at most 128 tokens long.
MAX_SEQ_LENGTH = 512
# Convert our train and test features to InputFeatures that BERT understands.
train_features = run_classifier.convert_examples_to_features(train_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)
"""
test_features = run_classifier.convert_examples_to_features(test_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)
"""
# In[20]:
# `create_model` builds a model. First, it loads the BERT tf hub module again (this time to extract the computation graph).
#Next, it creates a single new layer that will be trained to adapt BERT to our task
#(i.e. classifying text). This strategy of using a mostly trained model is called [fine-tuning](http://wiki.fast.ai/index.php/Fine_tuning).
def create_model(is_predicting, input_ids, input_mask, segment_ids, labels,
num_labels):
"""Creates a classification model."""
bert_module = hub.Module(
BERT_MODEL_HUB,
trainable=True)
bert_inputs = dict(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids)
bert_outputs = bert_module(
inputs=bert_inputs,
signature="tokens",
as_dict=True)
# Use "pooled_output" for classification tasks on an entire sentence.
# Use "sequence_outputs" for token-level output.
output_layer = bert_outputs["pooled_output"]
pooled_output = output_layer#added 25March
hidden_size = output_layer.shape[-1].value
# Create our own layer to tune for politeness data.
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
# Dropout helps prevent overfitting
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
probs = tf.nn.softmax(logits, axis=-1)#added 25March
# Convert labels into one-hot encoding
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
predicted_labels = tf.squeeze(tf.argmax(log_probs, axis=-1, output_type=tf.int32))
# If we're predicting, we want predicted labels and the probabiltiies.
if is_predicting:
return (predicted_labels, log_probs, probs, pooled_output)
# If we're train/eval, compute loss between predicted and actual label
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
#return (loss, predicted_labels, log_probs)
return (loss, predicted_labels, log_probs, probs, pooled_output)#added 25March
# In[ ]:
# In[21]:
# Next we'll wrap our model function in a `model_fn_builder` function that adapts our model to work for training, evaluation, and prediction.
# In[14]:
# model_fn_builder actually creates our model function
# using the passed parameters for num_labels, learning_rate, etc.
def model_fn_builder(num_labels, learning_rate, num_train_steps,
num_warmup_steps):
"""Returns `model_fn` closure for TPUEstimator."""
def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
is_predicting = (mode == tf.estimator.ModeKeys.PREDICT)
# TRAIN and EVAL
if not is_predicting:
"""
(loss, predicted_labels, log_probs) = create_model(
is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)
"""
# this should be changed in both places
(loss, predicted_labels, log_probs, probs, pooled_output) = create_model(
is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)
train_op = optimization.create_optimizer(
loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu=False)
# Calculate evaluation metrics.
def metric_fn(label_ids, predicted_labels):
accuracy = tf.metrics.accuracy(label_ids, predicted_labels)
f1_score = tf.contrib.metrics.f1_score(
label_ids,
predicted_labels)
auc = tf.metrics.auc(
label_ids,
predicted_labels)
recall = tf.metrics.recall(
label_ids,
predicted_labels)
precision = tf.metrics.precision(
label_ids,
predicted_labels)
true_pos = tf.metrics.true_positives(
label_ids,
predicted_labels)
true_neg = tf.metrics.true_negatives(
label_ids,
predicted_labels)
false_pos = tf.metrics.false_positives(
label_ids,
predicted_labels)
false_neg = tf.metrics.false_negatives(
label_ids,
predicted_labels)
return {
"eval_accuracy": accuracy,
"f1_score": f1_score,
"auc": auc,
"precision": precision,
"recall": recall,
"true_positives": true_pos,
"true_negatives": true_neg,
"false_positives": false_pos,
"false_negatives": false_neg
}
eval_metrics = metric_fn(label_ids, predicted_labels)
if mode == tf.estimator.ModeKeys.TRAIN:
return tf.estimator.EstimatorSpec(mode=mode,
loss=loss,
train_op=train_op)
else:
return tf.estimator.EstimatorSpec(mode=mode,
loss=loss,
eval_metric_ops=eval_metrics)
else:
#(predicted_labels, log_probs) = create_model(is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)
(predicted_labels, log_probs, probs, pooled_output)=create_model(is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)
# return dictionary of all the values you wanted
predictions = {'log_probabilities': log_probs,'probabilities': probs,'labels': predicted_labels,'pooled_output': pooled_output}
"""
predictions = {
'probabilities': log_probs,
'labels': predicted_labels
}
"""
return tf.estimator.EstimatorSpec(mode, predictions=predictions)
# Return the actual model function in the closure
return model_fn
# In[22]:
# In[15]:
# Compute train and warmup steps from batch size
# These hyperparameters are copied from this colab notebook (https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)
BATCH_SIZE = 32
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 2.0
# Warmup is a period of time where hte learning rate
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 500
SAVE_SUMMARY_STEPS = 100
# In[23]:
# In[16]:
# Compute # train and warmup steps from batch size
num_train_steps = int((len(train_features) / BATCH_SIZE) * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)
#epochs = steps * batch_size * worker_gpu / training_subwords
#effecive batch size is batch_size * worker_gpu
# In[17]:
# Specify outpit directory and number of checkpoint steps to save
run_config = tf.estimator.RunConfig(
model_dir=OUTPUT_DIR,
save_summary_steps=SAVE_SUMMARY_STEPS,
save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS)
# In[18]:
model_fn = model_fn_builder(
num_labels=len(label_list),
learning_rate=LEARNING_RATE,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps)
estimator = tf.estimator.Estimator(
model_fn=model_fn,
config=run_config,
params={"batch_size": BATCH_SIZE})
# Next we create an input builder function that takes our training feature set (`train_features`) and produces a generator. This is a pretty standard design pattern for working with Tensorflow [Estimators](https://www.tensorflow.org/guide/estimators).
# In[24]:
# In[19]:
# Create an input function for training. drop_remainder = True for using TPUs.
train_input_fn = run_classifier.input_fn_builder(
features=train_features,
seq_length=MAX_SEQ_LENGTH,
is_training=True,
drop_remainder=False)
# ### Model Training
# In[46]:
print(f'Beginning Training!')
current_time = datetime.now()
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print("Training took time ", datetime.now() - current_time)
"""
# ### Model Testing
# In[47]:
test_input_fn = run_classifier.input_fn_builder(
features=test_features,
seq_length=MAX_SEQ_LENGTH,
is_training=False,
drop_remainder=False)
# In[48]:
estimator.evaluate(input_fn=test_input_fn, steps=None)
"""
# In[25]:
# ### Prediction
# In[24]:
def getPrediction(in_sentences):
labels = ["Negative", "Positive"]
input_examples = [run_classifier.InputExample(guid="", text_a = x, text_b = None, label = 0) for x in in_sentences] # here, "" is just a dummy label
input_features = run_classifier.convert_examples_to_features(input_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
predict_input_fn = run_classifier.input_fn_builder(features=input_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
predictions = estimator.predict(predict_input_fn)
#return predictions
return [(sentence, prediction['log_probabilities'],prediction['probabilities'], labels[prediction['labels']],prediction['pooled_output']) for sentence, prediction in zip(in_sentences, predictions)]
# In[25]:
pred_sentences = [
"They sold me something I didn't want",
]
The code in the video that gets CLS
embedding is as below
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
model.eval()
with torch.no_grad():
# Forward pass, return hidden states and predictions.
# This will return the logits rather than the loss because we have
# not provided labels.
logits, encoded_layers = model(
input_ids = input_ids,
token_type_ids = None,
attention_mask = attn_mask)
# Retrieve our sentence embedding--take the `[CLS]` embedding from the final
# layer.
layer_i = 12 # The last BERT layer before the classifier.
batch_i = 0 # Only one input in the batch.
token_i = 0 # The first token, corresponding to [CLS]
# Grab the embedding.
vec = encoded_layers[layer_i][batch_i][token_i]
High-performance semantic similarity with BERT A big part of NLP relies on similarity in highly-dimensional spaces. Typically an NLP solution will take some text, process it to create a big vector/array representing said text — then perform several transformations. It's highly-dimensional magic.
While a vanilla BERT can be used for encoding sentences, the embeddings generated with it are not robust. As we can see below, the samples deemed similar by the model are often more lexically than semantically related. Small perturbations in input samples result in large changes of predicted similarity.
An important note here is that BERT is not trained for semantic sentence similarity directly like the Universal Sentence Encoder or InferSent models. Therefore, BERT embeddings cannot be used directly to apply cosine distance to measure similarity.
Google's BERT model consist of 12 layers of Transformer Encoders with 12 heads of attention each, and every layer embedding size (or hidden size) is 768. Hence it's label in TF hub: bert_uncased_L-12_H-768_A-12
. Uncased is to indicate that BERT is case insensitive i.e. every word is lower cased before processing.
Your output of the last layer is 512 (MAX_SEQ_LENGTH
) by 768 (hidden_size). First vector (index zero) corresponds to [CLS
]. That is what you get from bert_outputs["pooled_output"]
. So you do getting output "similar" to the one you intend (in case your batch_size
=1, if it is set to other value you just drop information for all sentences but first one).
layer_i = 12 # The last BERT layer before the classifier.
batch_i = 0 # Only one input in the batch.
token_i = 0 # The first token, corresponding to [CLS]
There might be many answers to your question "why the results are horrible". But it seems to me that it's in the fine tuning process. On top of BERT you add a simple NN, which is called "head", that is trained do you downstream task. In your case you optimize the whole network (BERT and the top head) to solve sentiment analysis task. After that you try to use features used as input to the head to get answer for the different task - semantic similarity. While it is possible to get somehow useful features for the semantic similarity, these (features) are optimized for differentiating sentiment and might not very useful for other tasks. And I did not see anything in your code indicating some kind of adjustments to the new task.
So what you need (IMO) to do is to
Based on your code, just to demonstrate how to use embeddings as in the video:
import scipy
for i in range(len(predictions)):
print(i, pred_sentences[i])
print()
for i in range(len(predictions)):
for j in range(i+1, len(predictions)):
print (f'{i}:{j} >> {scipy.spatial.distance.cosine(predictions[i][-1],predictions[j][-1])}')
Will provide following output:
0 That movie was absolutely fantastic.
1 This film is creative and surprising.
2 Ford is an American multinational automaker that has its main headquarters in Dearborn, Michigan, a suburb of Detroit.
3 The Volkswagen Group with its headquarters in Wolfsburg, Germany is one of the world's leading manufacturers of automobiles and commercial vehicles.
0:1 >> 0.021687865257263184
0:2 >> 0.3452081084251404
0:3 >> 0.2836960554122925
1:2 >> 0.3700438141822815
1:3 >> 0.3061264753341675
2:3 >> 0.01616525650024414
As you can see, sentences 0 and 1 are much closer to each other than to 2 and 3, as expected. And 2 and 3 is similar between them and are more distant to both 0 an 1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With