Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras initialize large embeddings layer with pretrained embeddings

I am trying to re-train a word2vec model in Keras 2 with Tensorflow backend by using pretrained embeddings and custom corpus.

This is how I initialize the embeddings layer with pretrained embeddings:

embedding = Embedding(vocab_size, embedding_dim,
                      input_length=1, name='embedding',
                      embeddings_initializer=lambda x: pretrained_embeddings)

where pretrained_embeddings is a big matrix of size vocab_size x embedding_dim

This works as long as pretrained_embeddings is not too big.

In my case unfortunately this is not the case - vocab_size=2270872 and embedding_dim=300.

Upon initializing the Embeddings layer I get the error:

Cannot create a tensor proto whose content is larger than 2GB.

The error comes from the function add_weight() in /opt/r/anaconda3/lib/python3.6/site-packages/keras/engine/base_layer.py, more specifically the following line:

weight = K.variable(initializer(shape),
                    dtype=dtype,
                    name=name,
                    constraint=constraint)

initializer is the lambda function from above, which returns the big matrix. shape is (2270872, 300) as already mentioned.

Is it possible to solve this issue without having to go to low-level Tensorflow programming ? If I switch to Theano as a backend the code runs fine, but I'd like to use Tensorflow for its better long-term prospects.

The only similar Stackoverflow question I found was this, which proposes placeholder variables, but I am not sure how I can apply them on the level of Keras.

Thanks a lot

Edit: I am more than willing to work around this issue on the level of the Tensorflow backend. It's just that I don't know how to combine in this case Tensorflow and Keras code in the same application. Most examples are either one or the other, not both.

For example, what use are the Tensorflow placeholder variables when the initialization of the Embeddings layer in Keras will inevitably invoke the add_weight() function, which causes the issue ?

Solution:

As hinted by in @blue-phoenox's comment I rewrote the code like this:

embedding = Embedding(vocab_size, embedding_dim,
                      input_length=1, 
                      name='embedding')
embedding.build(input_shape=(1,)) # the input_shape here has no effect in the build function
embedding.set_weights([pretrained_embeddings])

That did it. Thanks again @blue-phoenox.

like image 682
Pavlin Mavrodiev Avatar asked Nov 21 '18 17:11

Pavlin Mavrodiev


1 Answers

Instead of using the embeddings_initializer argument of the Embedding layer you can load pre-trained weights for your embedding layer using the weights argument, this way you should be able to hand over pre-trained embeddings larger than 2GB.

Here is a short example:

from keras.layers import Embedding

embedding_layer = Embedding(vocab_size,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

Where embedding_matrix is just a regular numpy matrix containing your weights.

For for examples you can also take a look here:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html


Edit:

As @PavlinMavrodiev (see end of question) pointed out correctly the weights argument is deprecated. He instead used the layer method set_weights to set the weights instead:

  • layer.set_weights(weights): sets the weights of the layer from a list of Numpy arrays (with the same shapes as the output of get_weights).

To get trained weights get_weights can be used:

  • layer.get_weights(): returns the weights of the layer as a list of Numpy arrays.

Both are methods from the Keras Layer-Baseclass and can be used for all keras layers, including embeddings layer.

like image 80
MBT Avatar answered Oct 21 '22 13:10

MBT