Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly use get_keras_embedding() in Gensim’s Word2Vec?

I am trying to build a translation network using embedding and RNN. I have trained a Gensim Word2Vec model and it is learning word associations pretty well. However, I couldn’t get my head around how to properly add the layer to a Keras model. (And how to do an ‘inverse embedding’ for the output. But that’s another question that had been answered: by default you can’t.)

In Word2Vec, when you input a string, e.g. model.wv[‘hello’], you get a vector representation of the word. However, I believe that the keras.layers.Embedding layer returned by Word2Vec's get_keras_embedding() takes a one-hot/tokenized input, instead of a string input. But the documentation provides no explanation on what the appropriate input is. I cannot figure out how to obtain the one-hot/tokenized vector of the vocabulary that has 1-to-1 correspondence with the Embedding layer’s input.

More elaboration below:

Currently my workaround is to apply the embedding outside Keras before feeding it to the network. Is there any detriment in doing this? I will set the embedding to non-trainable anyway. So far I have noticed that memory use is extremely inefficient (like 50GB even before declaring the Keras model for a collection of 64-word-long sentences) having to load the padded inputs and the weights outside the model. Maybe generator can help.

The following is my code. Inputs are padded to 64-words long. The Word2Vec embedding has 300 dimensions. There are probably a lot of mistakes here due to repeated experimentation trying to make embedding work. Suggestions are welcome.

import gensim
word2vec_model = gensim.models.Word2Vec.load(“word2vec.model")
from keras.models import Sequential
from keras.layers import Embedding, GRU, Input, Flatten, Dense, TimeDistributed, Activation, PReLU, RepeatVector, Bidirectional, Dropout
from keras.optimizers import Adam, Adadelta
from keras.callbacks import ModelCheckpoint
from keras.losses import sparse_categorical_crossentropy, mean_squared_error, cosine_proximity

keras_model = Sequential()
keras_model.add(word2vec_model.wv.get_keras_embedding(train_embeddings=False))
keras_model.add(Bidirectional(GRU(300, return_sequences=True, dropout=0.1, recurrent_dropout=0.1, activation='tanh')))
keras_model.add(TimeDistributed(Dense(600, activation='tanh')))
# keras_model.add(PReLU())
# ^ For some reason I get error when I add Activation ‘outside’:
# int() argument must be a string, a bytes-like object or a number, not 'NoneType'
# But keras_model.add(Activation('relu')) works.
keras_model.add(Dense(source_arr.shape[1] * source_arr.shape[2]))
# size = max-output-sentence-length * embedding-dimensions to learn the embedding vector and find the nearest word in word2vec_model.wv.similar_by_vector() afterwards.
# Alternatively one can use Dense(vocab_size) and train the network to output one-hot categorical words instead.
# Remember to change Keras loss to sparse_categorical_crossentropy.
# But this won’t benefit from Word2Vec. 

keras_model.compile(loss=mean_squared_error,
              optimizer=Adadelta(),
              metrics=['mean_absolute_error'])
keras_model.summary()
_________________________________________________________________ 
Layer (type)                 Output Shape              Param #   
================================================================= 
embedding_19 (Embedding)     (None, None, 300)         8219700   
_________________________________________________________________ 
bidirectional_17 (Bidirectio (None, None, 600)         1081800   
_________________________________________________________________ 
activation_4 (Activation)    (None, None, 600)         0         
_________________________________________________________________ 
time_distributed_17 (TimeDis (None, None, 600)         360600    
_________________________________________________________________ 
dense_24 (Dense)             (None, None, 19200)       11539200  
================================================================= 
Total params: 21,201,300 
Trainable params: 12,981,600 
Non-trainable params: 8,219,700
_________________________________________________________________
filepath="best-weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
callbacks_list = [checkpoint]
keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)

Which throws an error when I try to fit the model with text:

Train on 8000 samples, validate on 2000 samples Epoch 1/100

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-865f8b75fbc3> in <module>()
      2 checkpoint = ModelCheckpoint(filepath, monitor='val_mean_absolute_error', verbose=1, save_best_only=True, mode='auto')
      3 callbacks_list = [checkpoint]
----> 4 keras_model.fit(array_of_word_lists, array_of_word_lists_AFTER_being_transformed_by_word2vec, epochs=100, batch_size=2000, shuffle=True, callbacks=callbacks_list, validation_split=0.2)

~/virtualenv/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1040                                         initial_epoch=initial_epoch,
   1041                                         steps_per_epoch=steps_per_epoch,
-> 1042                                         validation_steps=validation_steps)
   1043 
   1044     def evaluate(self, x=None, y=None,

~/virtualenv/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, f, ins, out_labels, batch_size, epochs, verbose, callbacks, val_f, val_ins, shuffle, callback_metrics, initial_epoch, steps_per_epoch, validation_steps)
    197                     ins_batch[i] = ins_batch[i].toarray()
    198 
--> 199                 outs = f(ins_batch)
    200                 if not isinstance(outs, list):
    201                     outs = [outs]

~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in __call__(self, inputs)
   2659                 return self._legacy_call(inputs)
   2660 
-> 2661             return self._call(inputs)
   2662         else:
   2663             if py_any(is_tensor(x) for x in inputs):

~/virtualenv/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
   2612                 array_vals.append(
   2613                     np.asarray(value,
-> 2614                                dtype=tensor.dtype.base_dtype.name))
   2615         if self.feed_dict:
   2616             for key in sorted(self.feed_dict.keys()):

~/virtualenv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    490 
    491     """
--> 492     return array(a, dtype, copy=False, order=order)
    493 
    494 

ValueError: could not convert string to float: 'hello'

The following is an excerpt from Rajmak demonstrating how to use a tokenizer to convert words into the input of a Keras Embedding.

tokenizer = Tokenizer(num_words=MAX_NB_WORDS) 
tokenizer.fit_on_texts(all_texts) 
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
……
indices = np.arange(data.shape[0]) # get sequence of row index 
np.random.shuffle(indices) # shuffle the row indexes 
data = data[indices] # shuffle data/product-titles/x-axis
……
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0]) 
x_train = data[:-nb_validation_samples]
……
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

Keras embedding layer can be obtained by Gensim Word2Vec’s word2vec.get_keras_embedding(train_embeddings=False) method or constructed like shown below. The null word embeddings indicate the number of words not found in our pre-trained vectors (In this case Google News). This could possibly be unique words for brands in this context.

from keras.layers import Embedding
word_index = tokenizer.word_index
nb_words = min(MAX_NB_WORDS, len(word_index))+1

embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))

embedding_layer = Embedding(embedding_matrix.shape[0], # or len(word_index) + 1
                            embedding_matrix.shape[1], # or EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation

model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(300, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(150, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(75, 3, padding='valid',activation='relu',strides=2))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(150,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='sigmoid'))

model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])

model.summary()

model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128)
score = model.evaluate(x_val, y_val, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Here the embedding_layer is explicitly created using:

for word, i in word_index.items():
    if word in word2vec.vocab:
        embedding_matrix[i] = word2vec.word_vec(word)

However, if we use get_keras_embedding(), the embedding matrix is already constructed and fixed. I do not know how each word_index in the Tokenizer can be coerced match the corresponding word in get_keras_embedding()'s Keras embedding input.

So, what is the proper way to use Word2Vec's get_keras_embedding() in Keras?

like image 574
Moobie Avatar asked Jul 24 '18 07:07

Moobie


1 Answers

So I've found the solution. The Tokenized word index can be found in word2vec_model.wv.vocab[word].index and the converse can be obtained by word2vec_model.wv.index2word[word_index]. get_keras_embedding() takes the former as input.

I do the conversion as follows:

source_word_indices = []
for i in range(len(array_of_word_lists)):
    source_word_indices.append([])
    for j in range(len(array_of_word_lists[i])):
        word = array_of_word_lists[i][j]
        if word in word2vec_model.wv.vocab:
            word_index = word2vec_model.wv.vocab[word].index
            source_word_indices[i].append(word_index)
        else:
            # Do something. For example, leave it blank or replace with padding character's index.
            source_word_indices[i].append(padding_index)
source = numpy.array(source_word_indices)

Then finally keras_model.fit(source, ...

like image 125
Moobie Avatar answered Sep 30 '22 13:09

Moobie