Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)

I am curious as to how I can add a normal-randomized 300 dimension vector (elements' type = tf.float32) whenever a word unknown to the pre-trained vocabulary is encountered. I am using pre-trained GloVe word embeddings, but in some cases, I realize I encounter unknown words, and I want to create a normal-randomized word vector for this new found unknown word.

The problem is that with my current set up, I use tf.contrib.lookup.index_table_from_tensor to convert from words to integers based on the known vocabulary. This function can create new tokens and hash them for some predefined number of out of vocabulary words, but my embed will not contain an embedding for this new unknown hash value. I am uncertain if I can simply append a randomized embedding to the end of the embed list.

I also would like to do this in an efficient way, so pre-built tensorflow function or method involving tensorflow functions would probably be the most efficient. I define pre-known special tokens such as an end of sentence token and a default unknown as the empty string ("at index 0), but this is limited in its power to learn for various different unknown words. I currently use tf.nn.embedding_lookup() as the final embedding step.

I would like to be able to add new random 300d vectors for each unknown word in the training data, and I would also like to add pre-made random word vectors for any unknown tokens not seen in training that is possibly encountered during testing. What is the most efficient way of doing this?

def embed_tensor(string_tensor, trainable=True):
    """    
    Convert List of strings into list of indicies then into 300d vectors
    """
    # ordered lists of vocab and corresponding (by index) 300d vector
    vocab, embed = load_pretrained_glove()

    # Set up tensorflow look up from string word to unique integer
    vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
        mapping=tf.constant(vocab),
        default_value = 0)
    string_tensor = vocab_lookup.lookup(string_tensor)

    # define the word embedding 
    embedding_init = tf.Variable(tf.constant(np.asarray(embed),
                                 dtype=tf.float32),
                                 trainable=trainable,
                                 name="embed_init")

    # return the word embedded version of the sentence (300d vectors/word)
    return tf.nn.embedding_lookup(embedding_init, string_tensor)
like image 859
prijatelj Avatar asked Jul 15 '17 00:07

prijatelj


People also ask

How do you train word embeddings?

Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

How word embeddings are created?

Word embeddings are created using a neural network with one input layer, one hidden layer and one output layer. The computer does not understand that the words king, prince and man are closer together in a semantic sense than the words queen, princess, and daughter. All it sees are encoded characters to binary.


2 Answers

The code example below adapts your embed_tensor function such that words are embedded as follows:

  • For words that have a pretrained embedding, the embedding is initialized with the pretrained embedding. The embedding can be kept fixed during training if trainable is False.
  • For words in the training data that don't have a pretrained embedding, the embedding is initialized randomly. The embedding can be kept fixed during training if trainable is False.
  • For words in the test data that don't occur in the training data and don't have a pretrained embedding, a single randomly initialized embedding vector is used. This vector can't be trained.
import tensorflow as tf
import numpy as np

EMB_DIM = 300
def load_pretrained_glove():
    return ["a", "cat", "sat", "on", "the", "mat"], np.random.rand(6, EMB_DIM)

def get_train_vocab():
    return ["a", "dog", "sat", "on", "the", "mat"]

def embed_tensor(string_tensor, trainable=True):
  """
  Convert List of strings into list of indices then into 300d vectors
  """
  # ordered lists of vocab and corresponding (by index) 300d vector
  pretrained_vocab, pretrained_embs = load_pretrained_glove()
  train_vocab = get_train_vocab()
  only_in_train = list(set(train_vocab) - set(pretrained_vocab))
  vocab = pretrained_vocab + only_in_train

  # Set up tensorflow look up from string word to unique integer
  vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
    mapping=tf.constant(vocab),
    default_value=len(vocab))
  string_tensor = vocab_lookup.lookup(string_tensor)

  # define the word embedding
  pretrained_embs = tf.get_variable(
      name="embs_pretrained",
      initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
      shape=pretrained_embs.shape,
      trainable=trainable)
  train_embeddings = tf.get_variable(
      name="embs_only_in_train",
      shape=[len(only_in_train), EMB_DIM],
      initializer=tf.random_uniform_initializer(-0.04, 0.04),
      trainable=trainable)
  unk_embedding = tf.get_variable(
      name="unk_embedding",
      shape=[1, EMB_DIM],
      initializer=tf.random_uniform_initializer(-0.04, 0.04),
      trainable=False)

  embeddings = tf.concat([pretrained_embs, train_embeddings, unk_embedding], axis=0)

  return tf.nn.embedding_lookup(embeddings, string_tensor)

FYI, to have a sensible, non-random representation for words that don't occur in the training data and don't have a pretrained embedding, you could consider mapping words with a low frequency in your training data to an unk token (that is not in your vocabulary) and make the unk_embedding trainable. This way you learn a prototype for words that are unseen in the training data.

like image 152
GeertH Avatar answered Oct 05 '22 12:10

GeertH


I never tried it but I can try to provide a possible way using the same machineries of your code, but I will think of it more later.

The index_table_from_tensor method accepts a num_oov_buckets parameter that shuffles all your oov words into a predefined number of buckets.

If you set this parameter to a certain 'enough large' value, you will see your data spreads among these buckets (each bucket has an ID > ID of the last in-vocabulary word).

So,

  • if (at each lookup) you set (i.e. assign) the last rows (those corresponding to the buckets) of your embedding_init Variable to a random value
  • if you make num_oov_bucketsenough large that collisions will be minimized

you can obtain a behavior that is (an approximation of) what you are asking in a very efficient way.

The random behavior can be justified by a theory similar to the hash table ones: if the number of buckets is enough large, the hashing method of the strings will assign each oov word to a different bucket with high probability (i.e. minimizing collisions to the same buckets). Since, you are assigning a different random number to each different bucket, you can obtain a (almost) different mapping of each oov word.

like image 36
Giuseppe Marra Avatar answered Oct 05 '22 11:10

Giuseppe Marra