How LSTM work with word embeddings for text classification, example in Keras

Tags:

I am trying to understand how LSTM is used to classify text sentences (word sequences) consists of pre-trained word embeddings. I am reading through some posts about lstm and I am confused about the detailed procedure:

IMDB classification using LSTM on keras: https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ Colah's explanation on LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Say for example, I want to use lstm to classify movie reviews, each review has fixed length of 500 words. And I am using pre-trained word embeddings (from fasttext) that gives 100-dimension vector for each word. What will be the dimensions of Xt to feed into the LSTM? And how is the LSTM trained? If each Xt is a 100-dimension vector represent one word in a review, do I feed each word in a review to a LSTM at a time? What will LSTM do in each epoch? I am really confused...

lstm cell from Colah's blog

# LSTM for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(7)
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In the above code example (taken from Jason Brownlee's blog https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/), a LSTM of 100 cells/neurons is used. How is the 100 neurons interconnected? Why can't I just use 1 cell in the figure above for classification since it is a recurrent manner so it feeds the output back to itself in the next timestamp? Any visualization graphs will be welcome.

Thanks!!

442

asked May 18 '18 20:05

yuhengd

2 Answers

Shapes with the embedding:

Shape of the input data: X_train.shape == (reviews, words), which is (reviews, 500)

In the LSTM (after the embedding, or if you didn't have an embedding)

Shape of the input data: (reviews, words, embedding_size):
- (reviews, 500, 100) - where 100 was automatically created by the embedding
Input shape for the model (if you didn't have an embedding layer) could be either:
- input_shape = (500, 100)
- input_shape = (None, 100) - This option supports variable length reviews
Each Xt is a slice from input_data[:,timestep,:], which results in shape:
- (reviews, 100)
- But this is entirely automatic, made by the layer itself.
Each Ht is discarded, the result is only the last h, because you're not using return_sequences=True (but this is ok for your model).

Your code seems to be doing everything, so you don't have to do anything special to train this model. Use fit with a proper X_train and you will get y_train with shape (reviews,1).

Questions:

If each Xt is a 100-dimension vector represent one word in a review, do I feed each word in a review to a LSTM at a time?

No, the LSTM layer is already doing everything by itself, including all recurrent steps, provided its input has shape (reviews, words, embedding_size)

How is the 100 neurons interconnected?

They are sort of parallel (you can imagine 100 images like the one you posted, all parallel), almost the same as other kinds of usual layers.

But during the recurrent steps, there is a matematical expression that make them conversate (unfortunately I can't explain exactly how).

Why can't I just use 1 cell in the figure above for classification since it is a recurrent manner so it feeds the output back to itself in the next timestamp?

You can if you want, but the more cells, the smarter the layer (as happens with every other kind of layer)

There is nothing special about the number 100 chosen. It's probably a coincidence or a misunderstanding. It can be any number, 50 cells, 200 cells, 1000 cells...

Understanding LSTMs deeply:

All types of usages, one to many, many to one, many to many: https://stackoverflow.com/a/50235563/2097240

137

answered Oct 17 '22 14:10

Daniel Möller

You are confusing some terms, let's try to clarify what is going on step by step:

The data in your case will of shape (samples, 500) which means we have some number of reviews, each review is maximum 500 words encoded as integers.
Then the Embedding layer goes words[index] for every word in every sample giving a tensor (samples, 500, 100) if your embedding size is 100.
Now here is the confusing bit, when we say LSTM(100) it means a layer that runs a single LSTM cell (one like in Colah's diagram) over every word that has an output size of 100. Let me try that again, you create a single LSTM cell that transform the input into a 100 size output (hidden size) and the layer runs the same cell over the words.
Now we obtain (samples, 100) because the same LSTM processes every review of 500 words and return the final output which is of size 100. If for example we passed return_sequences=True then every hidden output, h-1, h, h+1 in the diagram would be returned so we would have obtained a shape (samples, 500, 100).
Finally, we pass the (samples, 100) to a Dense layer to make the prediction which gives (samples, 1) so a prediction for every review in the batch.

Take away lesson is that the LSTM layer wraps around a LSTMCell and runs it over every timestep for you so you don't have to write the loop operations yourself.

answered Oct 17 '22 14:10

nuric

Related questions
                            
                                Check whether Tensorflow is running on GPU
                            
                                What is Tensorflow equivalent of pytorch's conv1d?
                            
                                AttributeError: module 'tensorflow_core.compat.v1' has no attribute 'contrib'
                            
                                Batch normalization when batch size=1
                            
                                Tensorflow Model Subclassing Mutli-Input
                            
                                WARNING:tensorflow with constraint is deprecated and will be removed in a future version
                            
                                Why do I get CUDA out of memory when running PyTorch model [with enough GPU memory]?
                            
                                tf.keras model.predict results in memory leak
                            
                                Getting the output shape of deconvolution layer using tf.nn.conv2d_transpose in tensorflow
                            
                                Tensorflow freeze_graph script failing on model defined with Keras
                            
                                How to get the currently active tf.variable_scope in TensorFlow?
                            
                                Image similarity detection with TensorFlow
                            
                                Tensorflow.strided_slice missing argument 'strides'?
                            
                                Is it possible to export python and its necessary libraries into a environment independent file?
                            
                                Tensorflow: Linear regression with non-negative constraints
                            
                                tensorflow map_fn TensorArray has inconsistent shapes
                            
                                Printing class name and score in Tensorflow Object Detection API
                            
                                get intermediate output from Keras/Tensorflow during prediction
                            
                                Can the sigmoid activation function be used to solve regression problems in Keras?
                            
                                tensorflow how to merge batchnorm into convolution for faster inference

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How LSTM work with word embeddings for text classification, example in Keras

Tags:

machine-learning

tensorflow

deep-learning

keras

lstm

yuhengd

People also ask

2 Answers

Daniel Möller

nuric

Recent Activity

Donate For Us