I am trying to re-train a word2vec model in Keras 2 with Tensorflow backend by using pretrained embeddings and custom corpus.
This is how I initialize the embeddings layer with pretrained embeddings:
embedding = Embedding(vocab_size, embedding_dim,
input_length=1, name='embedding',
embeddings_initializer=lambda x: pretrained_embeddings)
where pretrained_embeddings
is a big matrix of size vocab_size
x embedding_dim
This works as long as pretrained_embeddings
is not too big.
In my case unfortunately this is not the case - vocab_size=2270872
and embedding_dim=300
.
Upon initializing the Embeddings layer I get the error:
Cannot create a tensor proto whose content is larger than 2GB.
The error comes from the function add_weight()
in
/opt/r/anaconda3/lib/python3.6/site-packages/keras/engine/base_layer.py
, more specifically the following line:
weight = K.variable(initializer(shape),
dtype=dtype,
name=name,
constraint=constraint)
initializer
is the lambda function from above, which returns the big matrix. shape
is (2270872, 300)
as already mentioned.
Is it possible to solve this issue without having to go to low-level Tensorflow programming ? If I switch to Theano as a backend the code runs fine, but I'd like to use Tensorflow for its better long-term prospects.
The only similar Stackoverflow question I found was this, which proposes placeholder variables, but I am not sure how I can apply them on the level of Keras.
Thanks a lot
Edit: I am more than willing to work around this issue on the level of the Tensorflow backend. It's just that I don't know how to combine in this case Tensorflow and Keras code in the same application. Most examples are either one or the other, not both.
For example, what use are the Tensorflow placeholder variables when the initialization of the Embeddings layer in Keras will inevitably invoke the add_weight() function, which causes the issue ?
Solution:
As hinted by in @blue-phoenox's comment I rewrote the code like this:
embedding = Embedding(vocab_size, embedding_dim,
input_length=1,
name='embedding')
embedding.build(input_shape=(1,)) # the input_shape here has no effect in the build function
embedding.set_weights([pretrained_embeddings])
That did it. Thanks again @blue-phoenox.
Instead of using the embeddings_initializer
argument of the Embedding layer you can load pre-trained weights for your embedding layer using the weights
argument, this way you should be able to hand over pre-trained embeddings larger than 2GB.
Here is a short example:
from keras.layers import Embedding
embedding_layer = Embedding(vocab_size,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
Where embedding_matrix
is just a regular numpy matrix containing your weights.
For for examples you can also take a look here:
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
Edit:
As @PavlinMavrodiev (see end of question) pointed out correctly the weights
argument is deprecated. He instead used the layer method set_weights
to set the weights instead:
layer.set_weights(weights)
: sets the weights of the layer from a list of Numpy arrays (with the same shapes as the output ofget_weights
).
To get trained weights get_weights
can be used:
layer.get_weights()
: returns the weights of the layer as a list of Numpy arrays.
Both are methods from the Keras Layer-Baseclass and can be used for all keras layers, including embeddings layer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With