How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)

Tags:

I am curious as to how I can add a normal-randomized 300 dimension vector (elements' type = tf.float32) whenever a word unknown to the pre-trained vocabulary is encountered. I am using pre-trained GloVe word embeddings, but in some cases, I realize I encounter unknown words, and I want to create a normal-randomized word vector for this new found unknown word.

The problem is that with my current set up, I use tf.contrib.lookup.index_table_from_tensor to convert from words to integers based on the known vocabulary. This function can create new tokens and hash them for some predefined number of out of vocabulary words, but my embed will not contain an embedding for this new unknown hash value. I am uncertain if I can simply append a randomized embedding to the end of the embed list.

I also would like to do this in an efficient way, so pre-built tensorflow function or method involving tensorflow functions would probably be the most efficient. I define pre-known special tokens such as an end of sentence token and a default unknown as the empty string ("at index 0), but this is limited in its power to learn for various different unknown words. I currently use tf.nn.embedding_lookup() as the final embedding step.

I would like to be able to add new random 300d vectors for each unknown word in the training data, and I would also like to add pre-made random word vectors for any unknown tokens not seen in training that is possibly encountered during testing. What is the most efficient way of doing this?

def embed_tensor(string_tensor, trainable=True):
    """    
    Convert List of strings into list of indicies then into 300d vectors
    """
    # ordered lists of vocab and corresponding (by index) 300d vector
    vocab, embed = load_pretrained_glove()

    # Set up tensorflow look up from string word to unique integer
    vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
        mapping=tf.constant(vocab),
        default_value = 0)
    string_tensor = vocab_lookup.lookup(string_tensor)

    # define the word embedding 
    embedding_init = tf.Variable(tf.constant(np.asarray(embed),
                                 dtype=tf.float32),
                                 trainable=trainable,
                                 name="embed_init")

    # return the word embedded version of the sentence (300d vectors/word)
    return tf.nn.embedding_lookup(embedding_init, string_tensor)

859

asked Jul 15 '17 00:07

prijatelj

2 Answers

The code example below adapts your embed_tensor function such that words are embedded as follows:

For words that have a pretrained embedding, the embedding is initialized with the pretrained embedding. The embedding can be kept fixed during training if trainable is False.
For words in the training data that don't have a pretrained embedding, the embedding is initialized randomly. The embedding can be kept fixed during training if trainable is False.
For words in the test data that don't occur in the training data and don't have a pretrained embedding, a single randomly initialized embedding vector is used. This vector can't be trained.

import tensorflow as tf
import numpy as np

EMB_DIM = 300
def load_pretrained_glove():
    return ["a", "cat", "sat", "on", "the", "mat"], np.random.rand(6, EMB_DIM)

def get_train_vocab():
    return ["a", "dog", "sat", "on", "the", "mat"]

def embed_tensor(string_tensor, trainable=True):
  """
  Convert List of strings into list of indices then into 300d vectors
  """
  # ordered lists of vocab and corresponding (by index) 300d vector
  pretrained_vocab, pretrained_embs = load_pretrained_glove()
  train_vocab = get_train_vocab()
  only_in_train = list(set(train_vocab) - set(pretrained_vocab))
  vocab = pretrained_vocab + only_in_train

  # Set up tensorflow look up from string word to unique integer
  vocab_lookup = tf.contrib.lookup.index_table_from_tensor(
    mapping=tf.constant(vocab),
    default_value=len(vocab))
  string_tensor = vocab_lookup.lookup(string_tensor)

  # define the word embedding
  pretrained_embs = tf.get_variable(
      name="embs_pretrained",
      initializer=tf.constant_initializer(np.asarray(pretrained_embs), dtype=tf.float32),
      shape=pretrained_embs.shape,
      trainable=trainable)
  train_embeddings = tf.get_variable(
      name="embs_only_in_train",
      shape=[len(only_in_train), EMB_DIM],
      initializer=tf.random_uniform_initializer(-0.04, 0.04),
      trainable=trainable)
  unk_embedding = tf.get_variable(
      name="unk_embedding",
      shape=[1, EMB_DIM],
      initializer=tf.random_uniform_initializer(-0.04, 0.04),
      trainable=False)

  embeddings = tf.concat([pretrained_embs, train_embeddings, unk_embedding], axis=0)

  return tf.nn.embedding_lookup(embeddings, string_tensor)

FYI, to have a sensible, non-random representation for words that don't occur in the training data and don't have a pretrained embedding, you could consider mapping words with a low frequency in your training data to an unk token (that is not in your vocabulary) and make the unk_embedding trainable. This way you learn a prototype for words that are unseen in the training data.

152

answered Oct 05 '22 12:10

GeertH

I never tried it but I can try to provide a possible way using the same machineries of your code, but I will think of it more later.

The index_table_from_tensor method accepts a num_oov_buckets parameter that shuffles all your oov words into a predefined number of buckets.

If you set this parameter to a certain 'enough large' value, you will see your data spreads among these buckets (each bucket has an ID > ID of the last in-vocabulary word).

So,

if (at each lookup) you set (i.e. assign) the last rows (those corresponding to the buckets) of your embedding_init Variable to a random value
if you make num_oov_bucketsenough large that collisions will be minimized

you can obtain a behavior that is (an approximation of) what you are asking in a very efficient way.

The random behavior can be justified by a theory similar to the hash table ones: if the number of buckets is enough large, the hashing method of the strings will assign each oov word to a different bucket with high probability (i.e. minimizing collisions to the same buckets). Since, you are assigning a different random number to each different bucket, you can obtain a (almost) different mapping of each oov word.

answered Oct 05 '22 11:10

Giuseppe Marra

Related questions
                            
                                How to access the camera - React Native
                            
                                Authentication for Azure Functions
                            
                                Can't Disable Offline Data In Firestore
                            
                                Call to conversion operator instead of converting constructor in c++17 during overload resolution
                            
                                Import a module from both within same package and from outside the package in Python 3
                            
                                Get rid of unnecessary root layouts for fullscreen activities
                            
                                How do I setup VScode debug session for Golang and AppEngine?
                            
                                JPA Storing OffsetDateTime with ZoneOffset
                            
                                What leads the first element of a printed list to be enclosed with backticks in R v3.5.1?
                            
                                Dockerfile copy files from amazon s3 or another source that needs credentials
                            
                                dagger android support to androidx.fragment
                            
                                How to run podman from inside a container?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to add new embeddings for unknown words in Tensorflow (training & pre-set for testing)

Tags:

python

tensorflow

nlp

prijatelj

People also ask

2 Answers

GeertH

Giuseppe Marra

Recent Activity

Donate For Us