Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Beam Search operate on the output of The Transformer?

According to my understanding (please correct me if I'm wrong), Beam Search is BFS where it only explores the "graph" of possibilities down b the most likely options, where b is the beam size.

To calculate/score each option, especially for the work that I'm doing which is in the field of NLP, we basically calculate the score of a possibility by calculating the probability of a token, given everything that comes before it.

This makes sense in a recurrent architecture, where you simply run the model you have with your decoder through the best b first tokens, to get the probabilities of the second tokens, for each of the first tokens. Eventually, you get sequences with probabilities and you just pick the one with the highest probability.

However, in the Transformer architecture, where the model doesn't have that recurrence, the output is the entire probability for each word in the vocabulary, for each position in the sequence (batch size, max sequence length, vocab size). How do I interpret this output for Beam Search? I can get the encodings for the input sequence, but since there isn't that recurrence of using the previous output as input for the next token's decoding, how do I go about calculating the probability of all the possible sequences that stems from the best b tokens?

like image 403
Tony Avatar asked Jun 19 '19 20:06

Tony


People also ask

Does transformer use beam search?

Constrained Beam Search with 🤗 TransformersHugging Face Transformers has a new feature! It's called constrained beam search and it allows us to guide the text generation process that previously left the model completely on its own.

How does beam search algorithm work?

Beam search is a heuristic search technique that always expands the W number of the best nodes at each level. It progresses level by level and moves downwards only from the best W nodes at each level. Beam Search uses breadth-first search to build its search tree.

How is beam search applied to machine translation?

The beam search strategy generates the translation word by word from left-to-right while keeping a fixed number (beam) of active candidates at each time step. By increasing the beam size, the translation perfor- mance can increase at the expense of significantly reducing the decoder speed.

What is beam search and why should we use beam search in the decoding step of the sequence to sequence model?

Beam search is the most popular search strategy for the sequence to sequence Deep NLP algorithms like Neural Machine Translation, Image captioning, Chatbots, etc. Beam search considers multiple best options based on beamwidth using conditional probability, which is better than the sub-optimal Greedy search.


1 Answers

The beam search works exactly in the same as with the recurrent models. The decoder is not recurrent (it's self-attentive), but it is still auto-regressive, i.e., generating a token is conditioned on previously generated tokens.

At the training time, the self-attention is masked, such that in only attend to words to the left from the word that is currently generated. It simulates the setup you have at inference time when you indeed only have the left context (because the right context has not been generated yet).

The only difference is that in the RNN decoder, you only use the last RNN state in every beam search step. With the Transformer, you always need to keep the entire hypothesis and do the self-attention over the entire left context.

like image 83
Jindřich Avatar answered Oct 17 '22 10:10

Jindřich