How to fill in the blank using bidirectional RNN and pytorch?

Question

I am trying to fill in the blank using a bidirectional RNN and pytorch.

The input will be like: The dog is _____, but we are happy he is okay.

The output will be like:

1. hyper (Perplexity score here) 
2. sad (Perplexity score here) 
3. scared (Perplexity score here)

I discovered this idea here: https://medium.com/@plusepsilon/the-bidirectional-language-model-1f3961d1fb27

import torch, torch.nn as nn
from torch.autograd import Variable

text = ['BOS', 'How', 'are', 'you', 'EOS']
seq_len = len(text)
batch_size = 1
embedding_size = 1
hidden_size = 1
output_size = 1

random_input = Variable(
    torch.FloatTensor(seq_len, batch_size, embedding_size).normal_(), requires_grad=False)

bi_rnn = torch.nn.RNN(
    input_size=embedding_size, hidden_size=hidden_size, num_layers=1, batch_first=False, bidirectional=True)

bi_output, bi_hidden = bi_rnn(random_input)

# stagger
forward_output, backward_output = bi_output[:-2, :, :hidden_size], bi_output[2:, :, hidden_size:]
staggered_output = torch.cat((forward_output, backward_output), dim=-1)

linear = nn.Linear(hidden_size * 2, output_size)

# only predict on words
labels = random_input[1:-1]

# for language models, use cross-entropy :)
loss = nn.MSELoss()
output = loss(linear(staggered_output), labels)

I am trying to reimplement the code above found at the bottom of the blog post. I am new to pytorch and nlp, and can't understand what the input and output to the code is.

Question about the input: I am guessing the input are the few words that are given. Why does one need beginning of sentence and end of sentence tags in this case? Why don't I see the input being a corpus on which the model is trained like other classic NLP problems? I would like to use the Enron email corpus to train the RNN.

Question about the output: I see the output is a tensor. My understanding is the tensor is a vector, so maybe a word vector in this case. How can you use the tensor to output the words themselves?

Szymon Maszke · Accepted Answer

As this question is rather open-ended I will start from the last parts, moving towards the more general answer to the main question posed in the title.

Quick note: as pointed in the comments by @Qusai Alothman, you should find a better resource on the topic, this one is rather sparse when it comes to necessary informations.

Additional note: full code for the process described in the last section would take way too much space to provide as an exact answer, it would be more of a blog post. I will highlight possible steps one should take to create such a network with helpful links as we go along.

Final note: If there is anything dumb down there below (or you would like to expand the answer in any way or form, please do correct me/add info by posting a comment below).

Question about the input

Input here is generated from the random normal distribution and has no connection to the actual words. It is supposed to represent word embeddings, e.g. representation of words as numbers carrying semantic (this is important!) meaning (sometimes depending on the context as well (see one of the current State Of The Art approaches, e.g. BERT)).

Shape of the input

In your example it is provided as:

seq_len, batch_size, embedding_size,

where

seq_len - means length of a single sentence (varies across your dataset), we will get to it later.
batch_size - how many sentences should be processed in one step of forward pass (in case of PyTorch it is the forward method of class inheriting from torch.nn.Module)
embedding_size - vector with which one word is represented (it might range from the usual 100/300 using word2vec up to 4096 or so using the more recent approaches like the BERT mentioned above)

In this case it's all hard-coded of size one, which is not really useful for a newcomer, it only outlines the idea that way.

Why does one need beginning of sentence and end of sentence tags in this case?

Correct me if I'm wrong, but you don't need it if your input is separated into sentences. It is used if you provide multiple sentences to the model, and want to indicate unambiguously the beginning and end of each (used with models which depend on the previous/next sentences, it seems to not be the case here). Those are encoded by special tokens (the ones which are not present in the entire corpus), so neural network "could learn" they represent end and beginning of sentence (one special token for this approach would be enough).

If you were to use serious dataset, I would advise to split your text using libraries like spaCy or nltk (the first one is a pleasure to use IMO), they do a really good job for this task.

You dataset might be already splitted into sentences, in those cases you are kind of ready to go.

Why don't I see the input being a corpus on which the model is trained like other classic NLP problems?

I don't recall models being trained on the corpuses as is, e.g. using strings. Usually those are represented by floating-points numbers using:

Simple approaches, e.g. Bag Of Words or TF-IDF
More sophisticated ones, which provide some information about word relationships (e.g. king is more semantically related to queen than to a, say, banana). Those were already linked above, some other noticeable might be GloVe or ELMo and tons of other creative approaches.

Question about the output

One should output indices into embeddings, which in turn correspond to words represented by a vector (more sophisticated approach mentioned above).

Each row in such embedding represents a unique word and it's respective columns are their unique representations (in PyTorch, first index might be reserved for the words for which a representation is unknown [if using pretrained embeddings], you may also delete those words, or represent them as aj average of sentence/document, there are some other viable approaches as well).

Loss provided in the example

# for language models, use cross-entropy :)
loss = nn.MSELoss()

For this task it makes no sense, as Mean Squared Error is a regression metric, not a classification one.

We want to use one for classification, so softmax should be used for multiclass case (we should be outputting numbers spanning [0, N], where N is the number of unique words in our corpus).

PyTorch's CrossEntropyLoss already takes logits (output of last layer without activation like softmax) and returns loss value for each example. I would advise this approach as it's numerically stable (and I like it as the most minimal one).

I am trying to fill in the blank using a bidirectional RNN and pytorch

This is a long one, I will only highlight steps I would undertake in order to create a model whose idea represents the one outlined in the post.

Basic preparation of dataset

You may use the one you mentioned above or start with something easier like 20 newsgroups from scikit-learn.

First steps should be roughly this:

scrape the metadata (if any) from your dataset (those might be HTML tags, some headers etc.)
split your text into sentences using a pre-made library (mentioned above)

Next, you would like to create your target (e.g. words to be filled) in each sentence. Each word should be replaced by a special token (say <target-token>) and moved to target.

Example:

sentence: Neural networks can do some stuff.

would give us the following sentences and it's respective targets:

sentence: <target-token> networks can do some stuff. target: Neural
sentence: Neural <target-token> can do some stuff. target: networks
sentence: Neural networks <target-token> do some stuff. target: can
sentence: Neural networks can <target-token> some stuff. target: do
sentence: Neural networks can do <target-token> stuff. target: some
sentence: Neural networks can do some <target-token>. target: some
sentence: Neural networks can do some stuff <target-token> target: .

You should adjust this approach to the problem at hand by correcting typos if there are any, tokenizing, lemmatizing and others, experiment!

Embeddings

Each word in each sentence should be replaced by an integer, which in turn points to it embedding.

I would advise you to use a pre-trained one. spaCy provides word vectors, but another interesting approach I would highly recommend is in the open source library flair.

You may train your own, but it would take a lot of time + a lot of data for unsupervised training, and I think it is way beyond the scope of this question.

Data batching

One should use PyTorch's torch.utils.data.Dataset and torch.utils.data.DataLoader.

In my case, a good idea is was to provide custom collate_fn to DataLoader, which is responsible for creating padded batches of data (or represented as torch.nn.utils.rnn.PackedSequence already).

Important: currently, you have to sort the batch by length (word-wise) and keep the indices able to "unsort" the batch into it's original form, you should remember that during implementation. You may use torch.sort for that task. In future versions of PyTorch, there is a chance, one might not have to do that, see this issue.

Oh, and remember to shuffle your dataset using DataLoader, while we're at it.

Model

You should create a proper model by inheriting from torch.nn.Module. I would advise you to create a more general model, where you can provide PyTorch's cells (like GRU, LSTM or RNN), multilayered and bidirectional (as is described in the post).

Something along those lines when it comes to model construction:

import torch


class Filler(torch.nn.Module):
    def __init__(self, cell, embedding_words_count: int):
        self.cell = cell
        # We want to output vector of N
        self.linear = torch.nn.Linear(self.cell.hidden_size, embedding_words_count)

    def forward(self, batch):
        # Assuming batch was properly prepared before passing into the network
        output, _ = self.cell(batch)
        # Batch shape[0] is the length of longest already padded sequence
        # Batch shape[1] is the length of batch, e.g. 32
        # Here we create a view, which allows us to concatenate bidirectional layers in general manner
        output = output.view(
            batch.shape[0],
            batch.shape[1],
            2 if self.cell.bidirectional else 1,
            self.cell.hidden_size,
        )

        # Here outputs of bidirectional RNNs are summed, you may concatenate it
        # It makes up for an easier implementation, and is another often used approach
        summed_bidirectional_output = output.sum(dim=2)
        # Linear layer needs batch first, we have to permute it.
        # You may also try with batch_first=True in self.cell and prepare your batch that way
        # In such case no need to permute dimensions
        linear_input = summed_bidirectional_output.permute(1, 0, 2)
        return self.linear(embedding_words_count)

As you can see, information about shapes can be obtained in a general fashion. Such approach will allow you to create a model with how many layers you want, bidirectional or not (batch_first argument is problematic, but you can get around it too in a general way, left it out for improved clarity), see below:

model = Filler(
    torch.nn.GRU(
        # Size of your embeddings, for BERT it could be 4096, for spaCy's word2vec 300
        input_size=300,
        hidden_size=100,
        num_layers=3,
        batch_first=False,
        dropout=0.4,
        bidirectional=True,
    ),
    # How many unique words are there in your dataset
    embedding_words_count=10000,
)

You may pass torch.nn.Embedding into your model (if pretrained and already filled), create it from numpy matrix or plethora of other approaches, it's highly dependent how your structure your code exactly. Still, please, make your code more general, do not hardcode shapes unless it's totally necessary (usually it's not).

Remember it's only a showcase, you will have to tune and fix it on your own. This implementation returns logits and no softmax layer is used. If you wish to calculate perplexity, you may have to add it in order to obtain a correct probability distribution across all possible vectors.

BTW: Here is some info on concatenation of bidirectional output of RNN.

Model training

I would highly recommend PyTorch ignite as it's quite customizable, you can log a lot of info using it, perform validation and abstract cluttering parts like for loops in training.

Oh, and split your model, training and others into separate modules, don't put everything into one unreadable file.

Final notes

This is the outline of how I would approach this problem, you may have more fun using attention networks instead of merely using the last output layer as in this example, though you shouldn't start with that.

And please check PyTorch's 1.0 documentation and do not follow blindly tutorials or blog posts you see online as they might be out of date really fast and quality of the code varies enormously. For example torch.autograd.Variable is deprecated as can be seen in the link.

How to fill in the blank using bidirectional RNN and pytorch?

Tags:

python

nlp

pytorch

pr338

1 Answers

Question about the input

Shape of the input

Why does one need beginning of sentence and end of sentence tags in this case?

Why don't I see the input being a corpus on which the model is trained like other classic NLP problems?

Question about the output

Loss provided in the example

I am trying to fill in the blank using a bidirectional RNN and pytorch

Basic preparation of dataset

Embeddings

Data batching

Model

Model training

Final notes

Szymon Maszke

Recent Activity

Donate For Us

How to fill in the blank using bidirectional RNN and pytorch?

Tags:

python

nlp

pytorch

pr338

1 Answers

Question about the input

Shape of the input

Why does one need beginning of sentence and end of sentence tags in this case?

Why don't I see the input being a corpus on which the model is trained like other classic NLP problems?

Question about the output

Loss provided in the example

I am trying to fill in the blank using a bidirectional RNN and pytorch

Basic preparation of dataset

Embeddings

Data batching

Model

Model training

Final notes

Szymon Maszke

Related questions

Recent Activity

Donate For Us