Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to build a Language model using LSTM that assigns probability of occurence for a given sentence

Currently, I am using Trigram to do this. It assigns the probability of occurrence for a given sentence. But Its limited to the only context of 2 words. But LSTM's can do more. So how to build an LSTM Model that assigns the probability of occurrence for a given sentence?

like image 273
Swamy Avatar asked Jul 01 '18 12:07

Swamy


People also ask

Can LSTM model assign the probability of occurrence?

It assigns the probability of occurrence for a given sentence. But Its limited to the only context of 2 words. But LSTM's can do more. So how to build an LSTM Model that assigns the probability of occurrence for a given sentence? Show activity on this post.

How do I create a model model for LSTM?

The model consists of an embedding layer, a LSTM layer, and a dense layer with a softmax activation (which uses the output at the last timestep of the LSTM to produce the probability of each word in the vocabulary given the context): # Define model model = Sequential () model.add (Embedding (input_dim=len (vocab) + 1, # vocabulary size.

How do you construct a simple probabilistic language model?

A simple probabilistic language model (a) is constructed by calculating n-gram probabilities (an n-gram being an n word sequence, n being an integer greater than 0). An n-gram’s probability is the conditional probability that the n-gram’s last word follows the a particular n-1 gram (leaving out the last word).

Why is my LSTM model making poor predictions of next words?

Since the seq_length is 50, only 50 tokens are taken using the tokenizer. We see that word based LSTM model in general has a very poor prediction of next words, regardless of whether the data is seen or unseen. Predictions may be made better by training the model for a larger number of epochs.


1 Answers

I have just coded a very simple example showing how one might compute the probability of occurrence of a sentence with a LSTM model. The full code can be found here.

Suppose we want to predict the probability of occurrence of a sentence for the following dataset (this rhyme was published in Mother Goose's Melody in London around 1765):

# Data
data = ["Two little dicky birds",
        "Sat on a wall,",
        "One called Peter,",
        "One called Paul.",
        "Fly away, Peter,",
        "Fly away, Paul!",
        "Come back, Peter,",
        "Come back, Paul."]

First of all, let's use keras.preprocessing.text.Tokenizer to create a vocabulary and tokenize the sentences:

# Preprocess data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
vocab = tokenizer.word_index
seqs = tokenizer.texts_to_sequences(data)

Our model will take a sequence of words as input (context), and will output the conditional probability distribution of each word in the vocabulary given the context. To this end, we prepare the training data by padding the sequences and sliding windows over them:

def prepare_sentence(seq, maxlen):
    # Pads seq and slides windows
    x = []
    y = []
    for i, w in enumerate(seq):
        x_padded = pad_sequences([seq[:i]],
                                 maxlen=maxlen - 1,
                                 padding='pre')[0]  # Pads before each sequence
        x.append(x_padded)
        y.append(w)
    return x, y

# Pad sequences and slide windows
maxlen = max([len(seq) for seq in seqs])
x = []
y = []
for seq in seqs:
    x_windows, y_windows = prepare_sentence(seq, maxlen)
    x += x_windows
    y += y_windows
x = np.array(x)
y = np.array(y) - 1  # The word <PAD> does not constitute a class
y = np.eye(len(vocab))[y]  # One hot encoding

I decided to slide windows separately for each verse, but this could be done differently.

Next, we define and train a simple LSTM model with Keras. The model consists of an embedding layer, a LSTM layer, and a dense layer with a softmax activation (which uses the output at the last timestep of the LSTM to produce the probability of each word in the vocabulary given the context):

# Define model
model = Sequential()
model.add(Embedding(input_dim=len(vocab) + 1,  # vocabulary size. Adding an
                                               # extra element for <PAD> word
                    output_dim=5,  # size of embeddings
                    input_length=maxlen - 1))  # length of the padded sequences
model.add(LSTM(10))
model.add(Dense(len(vocab), activation='softmax'))
model.compile('rmsprop', 'categorical_crossentropy')

# Train network
model.fit(x, y, epochs=1000)

The joint probability P(w_1, ..., w_n) of occurrence of a sentence w_1 ... w_n can be computed using the rule of conditional probability:

P(w_1, ..., w_n)=P(w_1)*P(w_2|w_1)*...*P(w_n|w_{n-1}, ..., w_1)

where each of these conditional probabilities is given by the LSTM model. Note that they might be very small, so it is sensible to work in log space in order to avoid numerical instability issues. Putting it all together:

# Compute probability of occurence of a sentence
sentence = "One called Peter,"
tok = tokenizer.texts_to_sequences([sentence])[0]
x_test, y_test = prepare_sentence(tok, maxlen)
x_test = np.array(x_test)
y_test = np.array(y_test) - 1  # The word <PAD> does not constitute a class
p_pred = model.predict(x_test)  # array of conditional probabilities
vocab_inv = {v: k for k, v in vocab.items()}

# Compute product
# Efficient version: np.exp(np.sum(np.log(np.diag(p_pred[:, y_test]))))
log_p_sentence = 0
for i, prob in enumerate(p_pred):
    word = vocab_inv[y_test[i]+1]  # Index 0 from vocab is reserved to <PAD>
    history = ' '.join([vocab_inv[w] for w in x_test[i, :] if w != 0])
    prob_word = prob[y_test[i]]
    log_p_sentence += np.log(prob_word)
    print('P(w={}|h={})={}'.format(word, history, prob_word))
print('Prob. sentence: {}'.format(np.exp(log_p_sentence)))

NOTE: This is a very small toy dataset and we might be overfitting.

like image 71
rvinas Avatar answered Sep 24 '22 22:09

rvinas