Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with large(>2GB) embedding lookup table in tensorflow?

When I use pre-trained word vectors to do classification with LSTM, I wondered how to deal with embedding lookup table larger than 2gb in tensorflow.

To do this, I tried to make embedding lookup table like the code below,

data = tf.nn.embedding_lookup(vector_array, input_data)

got this value error.

ValueError: Cannot create a tensor proto whose content is larger than 2GB

variable vector_array on the code is numpy array, and it contains about 14 million unique tokens and 100 dimension word vectors for each word.

thank you for your helping with

like image 402
shinys Avatar asked Feb 04 '23 04:02

shinys


1 Answers

You need to copy it to a tf variable. There's a great answer to this question in StackOverflow: Using a pre-trained word embedding (word2vec or Glove) in TensorFlow

This is how I did it:

embedding_weights = tf.Variable(tf.constant(0.0, shape=[embedding_vocab_size, EMBEDDING_DIM]),trainable=False, name="embedding_weights") 
embedding_placeholder = tf.placeholder(tf.float32, [embedding_vocab_size, EMBEDDING_DIM])
embedding_init = embedding_weights.assign(embedding_placeholder)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 
sess.run(embedding_init, feed_dict={embedding_placeholder: embedding_matrix})

You can then use the embedding_weights variable for performing the lookup (remember to store word-index mapping)

Update: Use of the variable is not required but it allows you to save it for future use so that you don't have to re-do the whole thing again (it takes a while on my laptop when loading very large embeddings). If that's not important, you can simply use placeholders like Niklas Schnelle suggested

like image 99
ltt Avatar answered Feb 08 '23 16:02

ltt