Given a state vector we can recursively decode a sequence in a greedy manner by generating each output successively, where each prediction is conditioned on the previous output. I read a paper recently that described using beam search during decoding with a beam size of 1 (k=1). If we are only retaining the best output at each step, isn't this just the same thing as greedy decoding, and offers none of the benefits typically provided by beam search?
Greedy Search DecoderA simple approximation is to use a greedy search that selects the most likely word at each step in the output sequence. This approach has the benefit that it is very fast, but the quality of the final output sequences may be far from optimal.
Beam Search makes two improvements over Greedy Search. With Greedy Search, we took just the single best word at each position. In contrast, Beam Search expands this and takes the best 'N' words. With Greedy Search, we considered each position in isolation.
But in beam search, only a predetermined number of best partial solutions are kept as candidates. It is thus a greedy algorithm.
Beam size, or beam width, is a parameter in the beam search algorithm which determines how many of the best partial solutions to evaluate. In an LSTM model of melody generation, for example, beam size limits the number of candidates to take as input for the decoder.
Finally found an answer: beam size of 1 is the same as greedy search.
From "Abstractive Sentence Summarization with Attentive Recurrent Neural Networks":
"k refers to the size of the beam for generation; k = 1 implies greedy generation."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With