I am curious as to how I can add a normal-randomized 300 dimension vector (elements' type = tf.float32) whenever a word unknown to the pre-trained vocabulary is encountered. I am using pre-trained GloVe word embeddings, but in some cases, I realize I encounter unknown words, and I want to create a normal-randomized word vector for this new found unknown word.
The problem is that with my current set up, I use tf.contrib.lookup.index_table_from_tensor to convert from words to integers based on the known vocabulary. This function can create new tokens and hash them for some predefined number of out of vocabulary words, but my embed
will not contain an embedding for this new unknown hash value. I am uncertain if I can simply append a randomized embedding to the end of the embed
list.
I also would like to do this in an efficient way, so pre-built tensorflow function or method involving tensorflow functions would probably be the most efficient. I define pre-known special tokens such as an end of sentence token and a default unknown as the empty string ("at index 0), but this is limited in its power to learn for various different unknown words. I currently use tf.nn.embedding_lookup() as the final embedding step.
I would like to be able to add new random 300d vectors for each unknown word in the training data, and I would also like to add pre-made random word vectors for any unknown tokens not seen in training that is possibly encountered during testing. What is the most efficient way of doing this?
def embed_tensor(string_tensor, trainable=True):
"""
Convert List of strings into list of indicies then into 300d vectors
"""
# ordered lists of vocab and corresponding (by index) 300d vector
vocab, embed = load_pretrained_glove()
# Set up tensorflow look up from string word to unique integer
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(vocab),
default_value = 0)
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
embedding_init = tf.Variable(tf.constant(np.asarray(embed),
dtype=tf.float32),
trainable=trainable,
name="embed_init")
# return the word embedded version of the sentence (300d vectors/word)
return tf.nn.embedding_lookup(embedding_init, string_tensor)
Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.
Word embeddings are created using a neural network with one input layer, one hidden layer and one output layer. The computer does not understand that the words king, prince and man are closer together in a semantic sense than the words queen, princess, and daughter. All it sees are encoded characters to binary.
The code example below adapts your embed_tensor
function such that words are embedded as follows:
trainable
is False
.trainable
is False
.import tensorflow as tf
import numpy as np
EMB_DIM = 300
def load_pretrained_glove():
return ["a", "cat", "sat", "on", "the", "mat"], np.random.rand(6, EMB_DIM)
def get_train_vocab():
return ["a", "dog", "sat", "on", "the", "mat"]
def embed_tensor(string_tensor, trainable=True):
"""
Convert List of strings into list of indices then into 300d vectors
"""
# ordered lists of vocab and corresponding (by index) 300d vector
pretrained_vocab, pretrained_embs = load_pretrained_glove()
train_vocab = get_train_vocab()
only_in_train = list(set(train_vocab) - set(pretrained_vocab))
vocab = pretrained_vocab + only_in_train
# Set up tensorflow look up from string word to unique integer
vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
mapping=tf.constant(vocab),
default_value=len(vocab))
string_tensor = vocab_lookup.lookup(string_tensor)
# define the word embedding
pretrained_embs = tf.get_variable(
name="embs_pretrained",
initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
shape=pretrained_embs.shape,
trainable=trainable)
train_embeddings = tf.get_variable(
name="embs_only_in_train",
shape=[len(only_in_train), EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=trainable)
unk_embedding = tf.get_variable(
name="unk_embedding",
shape=[1, EMB_DIM],
initializer=tf.random_uniform_initializer(-0.04, 0.04),
trainable=False)
embeddings = tf.concat([pretrained_embs, train_embeddings, unk_embedding], axis=0)
return tf.nn.embedding_lookup(embeddings, string_tensor)
FYI, to have a sensible, non-random representation for words that don't occur in the training data and don't have a pretrained embedding, you could consider mapping words with a low frequency in your training data to an unk token (that is not in your vocabulary) and make the unk_embedding
trainable. This way you learn a prototype for words that are unseen in the training data.
I never tried it but I can try to provide a possible way using the same machineries of your code, but I will think of it more later.
The index_table_from_tensor
method accepts a num_oov_buckets
parameter that shuffles all your oov words into a predefined number of buckets.
If you set this parameter to a certain 'enough large' value, you will see your data spreads among these buckets (each bucket has an ID > ID of the last in-vocabulary word).
So,
assign
) the last rows (those corresponding to the buckets) of your embedding_init
Variable to a random valuenum_oov_buckets
enough large that collisions will be minimizedyou can obtain a behavior that is (an approximation of) what you are asking in a very efficient way.
The random behavior can be justified by a theory similar to the hash table ones: if the number of buckets is enough large, the hashing method of the strings will assign each oov word to a different bucket with high probability (i.e. minimizing collisions to the same buckets). Since, you are assigning a different random number to each different bucket, you can obtain a (almost) different mapping of each oov word.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With