How to deal with large(2GB) embedding lookup table in tensorflow?

Question

When I use pre-trained word vectors to do classification with LSTM, I wondered how to deal with embedding lookup table larger than 2gb in tensorflow.

To do this, I tried to make embedding lookup table like the code below,

data = tf.nn.embedding_lookup(vector_array, input_data)

got this value error.

ValueError: Cannot create a tensor proto whose content is larger than 2GB

variable vector_array on the code is numpy array, and it contains about 14 million unique tokens and 100 dimension word vectors for each word.

thank you for your helping with

ltt · Accepted Answer

You need to copy it to a tf variable. There's a great answer to this question in StackOverflow: Using a pre-trained word embedding (word2vec or Glove) in TensorFlow

This is how I did it:

embedding_weights = tf.Variable(tf.constant(0.0, shape=[embedding_vocab_size, EMBEDDING_DIM]),trainable=False, name="embedding_weights") 
embedding_placeholder = tf.placeholder(tf.float32, [embedding_vocab_size, EMBEDDING_DIM])
embedding_init = embedding_weights.assign(embedding_placeholder)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 
sess.run(embedding_init, feed_dict={embedding_placeholder: embedding_matrix})

You can then use the embedding_weights variable for performing the lookup (remember to store word-index mapping)

Update: Use of the variable is not required but it allows you to save it for future use so that you don't have to re-do the whole thing again (it takes a while on my laptop when loading very large embeddings). If that's not important, you can simply use placeholders like Niklas Schnelle suggested

How to deal with large(>2GB) embedding lookup table in tensorflow?

Tags:

tensorflow

deep-learning

shinys

1 Answers

ltt

Recent Activity

Donate For Us