I've been trying to understand the sample code with https://www.tensorflow.org/tutorials/recurrent which you can find at https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py
(Using tensorflow 1.3.0.)
I've summarized (what I think are) the key parts, for my question, below:
size = 200 vocab_size = 10000 layers = 2 # input_.input_data is a 2D tensor [batch_size, num_steps] of # word ids, from 1 to 10000 cell = tf.contrib.rnn.MultiRNNCell( [tf.contrib.rnn.BasicLSTMCell(size) for _ in range(2)] ) embedding = tf.get_variable( "embedding", [vocab_size, size], dtype=tf.float32) inputs = tf.nn.embedding_lookup(embedding, input_.input_data) inputs = tf.unstack(inputs, num=num_steps, axis=1) outputs, state = tf.contrib.rnn.static_rnn( cell, inputs, initial_state=self._initial_state) output = tf.reshape(tf.stack(axis=1, values=outputs), [-1, size]) softmax_w = tf.get_variable( "softmax_w", [size, vocab_size], dtype=data_type()) softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type()) logits = tf.matmul(output, softmax_w) + softmax_b # Then calculate loss, do gradient descent, etc.
My biggest question is how do I use the produced model to actually generate a next word suggestion, given the first few words of a sentence? Concretely, I imagine the flow is like this, but I cannot get my head around what the code for the commented lines would be:
prefix = ["What", "is", "your"] state = #Zeroes # Call static_rnn(cell) once for each word in prefix to initialize state # Use final output to set a string, next_word print(next_word)
My sub-questions are:
(I'm asking them all as one question, as I suspect they are all connected, and connected to some gap in my understanding.)
What I was expecting to see here was loading an existing word2vec set of word embeddings (e.g. using gensim's KeyedVectors.load_word2vec_format()
), convert each word in the input corpus to that representation when loading in each sentence, and then afterwards the LSTM would spit out a vector of the same dimension, and we would try and find the most similar word (e.g. using gensim's similar_by_vector(y, topn=1)
).
Is using softmax saving us from the relatively slow similar_by_vector(y, topn=1)
call?
BTW, for the pre-existing word2vec part of my question Using pre-trained word2vec with LSTM for word generation is similar. However the answers there, currently, are not what I'm looking for. What I'm hoping for is a plain English explanation that switches the light on for me, and plugs whatever the gap in my understanding is. Use pre-trained word2vec in lstm language model? is another similar question.
UPDATE: Predicting next word using the language model tensorflow example and Predicting the next word using the LSTM ptb model tensorflow example are similar questions. However, neither shows the code to actually take the first few words of a sentence, and print out its prediction of the next word. I tried pasting in code from the 2nd question, and from https://stackoverflow.com/a/39282697/841830 (which comes with a github branch), but cannot get either to run without errors. I think they may be for an earlier version of TensorFlow?
ANOTHER UPDATE: Yet another question asking basically the same thing: Predicting Next Word of LSTM Model from Tensorflow Example It links to Predicting next word using the language model tensorflow example (and, again, the answers there are not quite what I am looking for).
In case it still isn't clear, what I am trying to write a high-level function called getNextWord(model, sentencePrefix)
, where model
is a previously built LSTM that I've loaded from disk, and sentencePrefix
is a string, such as "Open the", and it might return "pod". I then might call it with "Open the pod" and it will return "bay", and so on.
An example (with a character RNN, and using mxnet) is the sample()
function shown near the end of https://github.com/zackchase/mxnet-the-straight-dope/blob/master/chapter05_recurrent-neural-networks/simple-rnn.ipynb You can call sample()
during training, but you can also call it after training, and with any sentence you want.
Predicting the next word is a neural application that uses Recurrent neural networks. Since basic recurrent neural networks have a lot of flows we go for LSTM. Here we can make sure of having longer memory of what words are important with help of those three gates we saw earlier.
Unlike any feedforward neural network, LSTM has feedback connections. Therefore, it can predict values for point data and can predict sequential data like weather, stock market data, or work with audio or video data, which is considered sequential data.
In order to do that, you need to define the outputs as y[t: t + H] (instead of y[t] as in the current code) where y is the time series and H is the length of the forecast period (i.e. the number of days ahead that you want to forecast).
Load custom data instead of using the test set:
reader.py@ptb_raw_data test_path = os.path.join(data_path, "ptb.test.txt") test_data = _file_to_word_ids(test_path, word_to_id) # change this line
test_data
should contain word ids (print out word_to_id
for a mapping). As an example, it should look like: [1, 52, 562, 246] ...
We need to return the output of the FC layer (logits
) in the call to sess.run
ptb_word_lm.py@PTBModel.__init__ logits = tf.reshape(logits, [self.batch_size, self.num_steps, vocab_size]) self.top_word_id = tf.argmax(logits, axis=2) # add this line ptb_word_lm.py@run_epoch fetches = { "cost": model.cost, "final_state": model.final_state, "top_word_id": model.top_word_id # add this line }
Later in the function, vals['top_word_id']
will have an array of integers with the ID of the top word. Look this up in word_to_id
to determine the predicted word. I did this a while ago with the small model, and the top 1 accuracy was pretty low (20-30% iirc), even though the perplexity was what was predicted in the header.
Why use a random (uninitialized, untrained) word-embedding?
You'd have to ask the authors, but in my opinion, training the embeddings makes this more of a standalone tutorial: instead of treating embedding as a black box, it shows how it works.
Why use softmax?
The final prediction is not determined by the cosine similarity to the output of the hidden layer. There is an FC layer after the LSTM that converts the embedded state to a one-hot encoding of the final word.
Here's a sketch of the operations and dimensions in the neural net:
word -> one hot code (1 x vocab_size) -> embedding (1 x hidden_size) -> LSTM -> FC layer (1 x vocab_size) -> softmax (1 x vocab_size)
Does the hidden layer have to match the dimension of the input (i.e. the dimension of the word2vec embeddings)
Technically, no. If you look at the LSTM equations, you'll notice that x (the input) can be any size, as long as the weight matrix is adjusted appropriately.
How/Can I bring in a pre-trained word2vec model, instead of that uninitialized one?
I don't know, sorry.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With