How to build a Language model using LSTM that assigns probability of occurence for a given sentence

Tags:

Currently, I am using Trigram to do this. It assigns the probability of occurrence for a given sentence. But Its limited to the only context of 2 words. But LSTM's can do more. So how to build an LSTM Model that assigns the probability of occurrence for a given sentence?

273

asked Jul 01 '18 12:07

Swamy

1 Answers

I have just coded a very simple example showing how one might compute the probability of occurrence of a sentence with a LSTM model. The full code can be found here.

Suppose we want to predict the probability of occurrence of a sentence for the following dataset (this rhyme was published in Mother Goose's Melody in London around 1765):

# Data
data = ["Two little dicky birds",
        "Sat on a wall,",
        "One called Peter,",
        "One called Paul.",
        "Fly away, Peter,",
        "Fly away, Paul!",
        "Come back, Peter,",
        "Come back, Paul."]

First of all, let's use keras.preprocessing.text.Tokenizer to create a vocabulary and tokenize the sentences:

# Preprocess data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
vocab = tokenizer.word_index
seqs = tokenizer.texts_to_sequences(data)

Our model will take a sequence of words as input (context), and will output the conditional probability distribution of each word in the vocabulary given the context. To this end, we prepare the training data by padding the sequences and sliding windows over them:

def prepare_sentence(seq, maxlen):
    # Pads seq and slides windows
    x = []
    y = []
    for i, w in enumerate(seq):
        x_padded = pad_sequences([seq[:i]],
                                 maxlen=maxlen - 1,
                                 padding='pre')[0]  # Pads before each sequence
        x.append(x_padded)
        y.append(w)
    return x, y

# Pad sequences and slide windows
maxlen = max([len(seq) for seq in seqs])
x = []
y = []
for seq in seqs:
    x_windows, y_windows = prepare_sentence(seq, maxlen)
    x += x_windows
    y += y_windows
x = np.array(x)
y = np.array(y) - 1  # The word <PAD> does not constitute a class
y = np.eye(len(vocab))[y]  # One hot encoding

I decided to slide windows separately for each verse, but this could be done differently.

Next, we define and train a simple LSTM model with Keras. The model consists of an embedding layer, a LSTM layer, and a dense layer with a softmax activation (which uses the output at the last timestep of the LSTM to produce the probability of each word in the vocabulary given the context):

# Define model
model = Sequential()
model.add(Embedding(input_dim=len(vocab) + 1,  # vocabulary size. Adding an
                                               # extra element for <PAD> word
                    output_dim=5,  # size of embeddings
                    input_length=maxlen - 1))  # length of the padded sequences
model.add(LSTM(10))
model.add(Dense(len(vocab), activation='softmax'))
model.compile('rmsprop', 'categorical_crossentropy')

# Train network
model.fit(x, y, epochs=1000)

The joint probability P(w_1, ..., w_n) of occurrence of a sentence w_1 ... w_n can be computed using the rule of conditional probability:

P(w_1, ..., w_n)=P(w_1)*P(w_2|w_1)*...*P(w_n|w_{n-1}, ..., w_1)

where each of these conditional probabilities is given by the LSTM model. Note that they might be very small, so it is sensible to work in log space in order to avoid numerical instability issues. Putting it all together:

# Compute probability of occurence of a sentence
sentence = "One called Peter,"
tok = tokenizer.texts_to_sequences([sentence])[0]
x_test, y_test = prepare_sentence(tok, maxlen)
x_test = np.array(x_test)
y_test = np.array(y_test) - 1  # The word <PAD> does not constitute a class
p_pred = model.predict(x_test)  # array of conditional probabilities
vocab_inv = {v: k for k, v in vocab.items()}

# Compute product
# Efficient version: np.exp(np.sum(np.log(np.diag(p_pred[:, y_test]))))
log_p_sentence = 0
for i, prob in enumerate(p_pred):
    word = vocab_inv[y_test[i]+1]  # Index 0 from vocab is reserved to <PAD>
    history = ' '.join([vocab_inv[w] for w in x_test[i, :] if w != 0])
    prob_word = prob[y_test[i]]
    log_p_sentence += np.log(prob_word)
    print('P(w={}|h={})={}'.format(word, history, prob_word))
print('Prob. sentence: {}'.format(np.exp(log_p_sentence)))

NOTE: This is a very small toy dataset and we might be overfitting.

answered Sep 24 '22 22:09

rvinas

Related questions
                            
                                Ridge regression with `glmnet` gives different coefficients than what I compute by "textbook definition"?
                            
                                Binary numbers instead of one hot vectors
                            
                                Predict NA (missing values) with machine learning
                            
                                Idea behind how many fully-connected layers should be use in a general CNN network
                            
                                LSTM Initial state from Dense layer
                            
                                Tensorflow/keras: "logits and labels must have the same first dimension" How to squeeze logits or expand labels?
                            
                                NLP Transformers: Best way to get a fixed sentence embedding-vector shape?
                            
                                Why PyTorch model takes multiple image size inside the model?
                            
                                Keras 'set_session' not available for Tensorflow 2.0
                            
                                High volume SVM (machine learning) system
                            
                                Machine Learning Algorithm for Completing Sparse Matrix Data
                            
                                Why is logistic regression called regression? [closed]
                            
                                Manual split versus Scikit Grid Search
                            
                                Weights in Convolutional network?
                            
                                DBSCAN for clustering data by location and density
                            
                                What is the difference between the train loss and train error?
                            
                                How to resolve "IndexError: too many indices for array"
                            
                                calculate precision and recall in a confusion matrix
                            
                                Tensorflow: Using neural network to classify positive or negative phrases
                            
                                Dropout rate guidance for hidden layers in a convolution neural network

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to build a Language model using LSTM that assigns probability of occurence for a given sentence

Tags:

machine-learning

neural-network

deep-learning

keras

Swamy

People also ask

1 Answers

rvinas

Recent Activity

Donate For Us