I am trying to understand how LSTM is used to classify text sentences (word sequences) consists of pre-trained word embeddings. I am reading through some posts about lstm and I am confused about the detailed procedure:
IMDB classification using LSTM on keras: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ Colah's explanation on LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Say for example, I want to use lstm to classify movie reviews, each review has fixed length of 500 words. And I am using pre-trained word embeddings (from fasttext) that gives 100-dimension vector for each word. What will be the dimensions of Xt to feed into the LSTM? And how is the LSTM trained? If each Xt is a 100-dimension vector represent one word in a review, do I feed each word in a review to a LSTM at a time? What will LSTM do in each epoch? I am really confused...
# LSTM for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(7)
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
In the above code example (taken from Jason Brownlee's blog https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/), a LSTM of 100 cells/neurons is used. How is the 100 neurons interconnected? Why can't I just use 1 cell in the figure above for classification since it is a recurrent manner so it feeds the output back to itself in the next timestamp? Any visualization graphs will be welcome.
Thanks!!
Text classification using LSTM You can use the full code for making the model on a similar data set. Before processing the model we created a similar pad sequence of the data so that it can be put to the model with the same length. In the modelling, we are making a sequential model.
1 Answer. Show activity on this post. Yes, it is possible to train an RNN-based architecture like GRU or LSTM with random sentences from a large corpus to learn word embeddings. The word embeddings of the corpus words can be learned while training a neural network on some task e.g. sentiment classification.
An LSTM network is a type of recurrent neural network (RNN) that can learn long-term dependencies between time steps of sequence data. A word embedding layer maps a sequence of word indices to embedding vectors and learns the word embedding during training. This layer requires Deep Learning Toolbox™.
To train a deep neural network to classify sequence data, you can use an LSTM network. An LSTM network enables you to input sequence data into a network, and make predictions based on the individual time steps of the sequence data. This example uses the Japanese Vowels data set as described in [1] and [2].
This data preparation step can be performed using the Tokenizer API also provided with Keras. The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.
So at the end of this article, you should be able to classify a text dataset using LSTM. We have to feed the data to LSTM in a particular format. First, we will count all the unique words in the dataset, and according to the number of times the word has accord in the dataset, we will make a dictionary.
Found 400000 word vectors. Now, let's prepare a corresponding embedding matrix that we can use in a Keras Embedding layer. It's a simple NumPy matrix where entry at index i is the pre-trained vector for the word of index i in our vectorizer 's vocabulary.
The word embeddings of our dataset can be learned while training a neural network on the classification problem. Before it can be presented to the network, the text data is first encoded so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API provided with Keras.
Shapes with the embedding:
X_train.shape == (reviews, words)
, which is (reviews, 500)
In the LSTM (after the embedding, or if you didn't have an embedding)
(reviews, words, embedding_size)
:
(reviews, 500, 100)
- where 100 was automatically created by the embedding input_shape = (500, 100)
input_shape = (None, 100)
- This option supports variable length reviews Xt
is a slice from input_data[:,timestep,:]
, which results in shape:
(reviews, 100)
Ht
is discarded, the result is only the last h
, because you're not using return_sequences=True
(but this is ok for your model). Your code seems to be doing everything, so you don't have to do anything special to train this model. Use fit
with a proper X_train
and you will get y_train
with shape (reviews,1)
.
Questions:
If each Xt is a 100-dimension vector represent one word in a review, do I feed each word in a review to a LSTM at a time?
No, the LSTM layer is already doing everything by itself, including all recurrent steps, provided its input has shape (reviews, words, embedding_size)
How is the 100 neurons interconnected?
They are sort of parallel (you can imagine 100 images like the one you posted, all parallel), almost the same as other kinds of usual layers.
But during the recurrent steps, there is a matematical expression that make them conversate (unfortunately I can't explain exactly how).
Why can't I just use 1 cell in the figure above for classification since it is a recurrent manner so it feeds the output back to itself in the next timestamp?
You can if you want, but the more cells, the smarter the layer (as happens with every other kind of layer)
There is nothing special about the number 100 chosen. It's probably a coincidence or a misunderstanding. It can be any number, 50 cells, 200 cells, 1000 cells...
Understanding LSTMs deeply:
You are confusing some terms, let's try to clarify what is going on step by step:
Embedding
layer goes words[index]
for every word in every sample giving a tensor (samples, 500, 100) if your embedding size is 100.LSTM(100)
it means a layer that runs a single LSTM cell (one like in Colah's diagram) over every word that has an output size of 100. Let me try that again, you create a single LSTM cell that transform the input into a 100 size output (hidden size) and the layer runs the same cell over the words.return_sequences=True
then every hidden output, h-1, h, h+1 in the diagram would be returned so we would have obtained a shape (samples, 500, 100).Dense
layer to make the prediction which gives (samples, 1) so a prediction for every review in the batch.Take away lesson is that the LSTM
layer wraps around a LSTMCell and runs it over every timestep for you so you don't have to write the loop operations yourself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With