Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pre-trained word2vec with LSTM for word generation

LSTM/RNN can be used for text generation. This shows way to use pre-trained GloVe word embeddings for Keras model.

  1. How to use pre-trained Word2Vec word embeddings with Keras LSTM model? This post did help.
  2. How to predict / generate next word when the model is provided with the sequence of words as its input?

Sample approach tried:

# Sample code to prepare word2vec word embeddings     import gensim documents = ["Human machine interface for lab abc computer applications",              "A survey of user opinion of computer system response time",              "The EPS user interface management system",              "System and human system engineering testing of EPS",              "Relation of user perceived response time to error measurement",              "The generation of random binary unordered trees",              "The intersection graph of paths in trees",              "Graph minors IV Widths of trees and well quasi ordering",              "Graph minors A survey"] sentences = [[word for word in document.lower().split()] for document in documents]  word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5)  # Code tried to prepare LSTM model for word generation from keras.layers.recurrent import LSTM from keras.layers.embeddings import Embedding from keras.models import Model, Sequential from keras.layers import Dense, Activation  embedding_layer = Embedding(input_dim=word_model.syn0.shape[0], output_dim=word_model.syn0.shape[1], weights=[word_model.syn0])  model = Sequential() model.add(embedding_layer) model.add(LSTM(word_model.syn0.shape[1])) model.add(Dense(word_model.syn0.shape[0]))    model.add(Activation('softmax')) model.compile(optimizer='sgd', loss='mse') 

Sample code / psuedocode to train LSTM and predict will be appreciated.

like image 361
Vishal Shukla Avatar asked Feb 06 '17 09:02

Vishal Shukla


1 Answers

I've created a gist with a simple generator that builds on top of your initial idea: it's an LSTM network wired to the pre-trained word2vec embeddings, trained to predict the next word in a sentence. The data is the list of abstracts from arXiv website.

I'll highlight the most important parts here.

Gensim Word2Vec

Your code is fine, except for the number of iterations to train it. The default iter=5 seems rather low. Besides, it's definitely not the bottleneck -- LSTM training takes much longer. iter=100 looks better.

word_model = gensim.models.Word2Vec(sentences, vector_size=100, min_count=1,                                      window=5, iter=100) pretrained_weights = word_model.wv.syn0 vocab_size, emdedding_size = pretrained_weights.shape print('Result embedding shape:', pretrained_weights.shape) print('Checking similar words:') for word in ['model', 'network', 'train', 'learn']:   most_similar = ', '.join('%s (%.2f)' % (similar, dist)                             for similar, dist in word_model.most_similar(word)[:8])   print('  %s -> %s' % (word, most_similar))  def word2idx(word):   return word_model.wv.vocab[word].index def idx2word(idx):   return word_model.wv.index2word[idx] 

The result embedding matrix is saved into pretrained_weights array which has a shape (vocab_size, emdedding_size).

Keras model

Your code is almost correct, except for the loss function. Since the model predicts the next word, it's a classification task, hence the loss should be categorical_crossentropy or sparse_categorical_crossentropy. I've chosen the latter for efficiency reasons: this way it avoids one-hot encoding, which is pretty expensive for a big vocabulary.

model = Sequential() model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size,                      weights=[pretrained_weights])) model.add(LSTM(units=emdedding_size)) model.add(Dense(units=vocab_size)) model.add(Activation('softmax')) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy') 

Note passing the pre-trained weights to weights.

Data preparation

In order to work with sparse_categorical_crossentropy loss, both sentences and labels must be word indices. Short sentences must be padded with zeros to the common length.

train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32) train_y = np.zeros([len(sentences)], dtype=np.int32) for i, sentence in enumerate(sentences):   for t, word in enumerate(sentence[:-1]):     train_x[i, t] = word2idx(word)   train_y[i] = word2idx(sentence[-1]) 

Sample generation

This is pretty straight-forward: the model outputs the vector of probabilities, of which the next word is sampled and appended to the input. Note that the generated text would be better and more diverse if the next word is sampled, rather than picked as argmax. The temperature based random sampling I've used is described here.

def sample(preds, temperature=1.0):   if temperature <= 0:     return np.argmax(preds)   preds = np.asarray(preds).astype('float64')   preds = np.log(preds) / temperature   exp_preds = np.exp(preds)   preds = exp_preds / np.sum(exp_preds)   probas = np.random.multinomial(1, preds, 1)   return np.argmax(probas)  def generate_next(text, num_generated=10):   word_idxs = [word2idx(word) for word in text.lower().split()]   for i in range(num_generated):     prediction = model.predict(x=np.array(word_idxs))     idx = sample(prediction[-1], temperature=0.7)     word_idxs.append(idx)   return ' '.join(idx2word(idx) for idx in word_idxs) 

Examples of generated text

deep convolutional... -> deep convolutional arithmetic initialization step unbiased effectiveness simple and effective... -> simple and effective family of variables preventing compute automatically a nonconvex... -> a nonconvex technique compared layer converges so independent onehidden markov a... -> a function parameterization necessary both both intuitions with technique valpola utilizes 

Doesn't make too much sense, but is able to produce sentences that look at least grammatically sound (sometimes).

The link to the complete runnable script.

like image 82
Maxim Avatar answered Oct 10 '22 04:10

Maxim