Using pretrained gensim Word2vec embedding in keras

Tags:

I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this?

PS: It is different from other questions because I am using gensim for word2vec training instead of keras.

799

asked Sep 01 '18 08:09

shivank01

1 Answers

Let's say you have following data that you need to encode

docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']

You must then tokenize it using the Tokenizer from Keras like this and find the vocab_size

t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1

You can then enocde it to sequences like this

encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)

You can then pad the sequences so that all the sequences are of a fixed length

max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

Then use the word2vec model to make embedding matrix

# load embedding as a dict
def load_embedding(filename):
    # load embedding into memory, skip first line
    file = open(filename,'r')
    lines = file.readlines()[1:]
    file.close()
    # create a map of words to vectors
    embedding = dict()
    for line in lines:
        parts = line.split()
        # key is string word, value is numpy array for vector
        embedding[parts[0]] = asarray(parts[1:], dtype='float32')
    return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
    # total vocabulary size plus 0 for unknown words
    vocab_size = len(vocab) + 1
    # define weight matrix dimensions with all 0
    weight_matrix = zeros((vocab_size, 100))
    # step vocab, store vectors using the Tokenizer's integer mapping
    for word, i in vocab.items():
        weight_matrix[i] = embedding.get(word)
    return weight_matrix

# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, t.word_index)

Once you have the embedding matrix you can use it in Embedding layer like this

e = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=4, trainable=False)

This layer can be used in making a model like this

model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

All the codes are adapted from this awesome blog post. follow it to know more about Embeddings using Glove

For using word2vec see this post

137

answered Oct 16 '22 01:10

Sreeram TP

Related questions
                            
                                Borderless matplotlib plots
                            
                                Simple DER Cert Parsing in python
                            
                                Scraping Data from Facebook with Python
                            
                                SQLAlchemy ORM __init__ method vs
                            
                                Best method for Building Strings in Python
                            
                                Getting list of odbc drivers available in windows 7 using python
                            
                                django how to get count for manytomany field
                            
                                No module named flask.ext.restful
                            
                                How to list the queued items in celery?
                            
                                Matplotlib rotate image file by X degrees
                            
                                How to access variables from different classes in tkinter?
                            
                                recv and recvfrom, socket programming using python
                            
                                What is the x = [m]*n syntax in Python?
                            
                                How to use groupby to concatenate strings in python pandas?
                            
                                Nan in summary histogram
                            
                                Circular pairs from array? [duplicate]
                            
                                Depth-first search (DFS) code in python
                            
                                Difference between tf.nn_conv2d and tf.nn.depthwise_conv2d
                            
                                Convert data types of a Pandas dataframe to match another
                            
                                Remove time portion of DateTime index in pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using pretrained gensim Word2vec embedding in keras

Tags:

python

keras

word-embedding

gensim

word2vec

shivank01

People also ask

1 Answers

Sreeram TP

Recent Activity

Donate For Us