I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this?
PS: It is different from other questions because I am using gensim for word2vec training instead of keras.
Gensim Python Library Most notably for this tutorial, it supports an implementation of the Word2Vec word embedding for learning new word vectors from text. It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.
Embedding layer is one of the available layers in Keras. This is mainly used in Natural Language Processing related applications such as language modeling, but it can also be used with other tasks that involve neural networks. While dealing with NLP problems, we can use pre-trained word embeddings such as GloVe.
Google's Word2vec Pretrained Word Embedding Word2Vec is one of the most popular pretrained word embeddings developed by Google. Word2Vec is trained on the Google News dataset (about 100 billion words).
Let's say you have following data that you need to encode
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
You must then tokenize it using the Tokenizer
from Keras like this and find the vocab_size
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
You can then enocde it to sequences like this
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)
You can then pad the sequences so that all the sequences are of a fixed length
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
Then use the word2vec model to make embedding matrix
# load embedding as a dict
def load_embedding(filename):
# load embedding into memory, skip first line
file = open(filename,'r')
lines = file.readlines()[1:]
file.close()
# create a map of words to vectors
embedding = dict()
for line in lines:
parts = line.split()
# key is string word, value is numpy array for vector
embedding[parts[0]] = asarray(parts[1:], dtype='float32')
return embedding
# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = zeros((vocab_size, 100))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = embedding.get(word)
return weight_matrix
# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, t.word_index)
Once you have the embedding matrix you can use it in Embedding
layer like this
e = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=4, trainable=False)
This layer can be used in making a model like this
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
All the codes are adapted from this awesome blog post. follow it to know more about Embeddings using Glove
For using word2vec see this post
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With