Is there a maximum sequence length for the output of a transformer?

Question

There's just one thing that I can't find an answer to : When putting the ouput back in the transformer, we compute it similarly to the inputs (with added masks), so is there also a sequence size limit ?

Even BERT has an input size limit of 512 tokens, so transformers are limited in how much they can take in. So is there something to make the output length as big as wanted or is there a fixed max length ?

If I wasn't clear enough, does the network generate words infinitely until the < end > token or is there a token limit for the outputs?

Jindřich · Accepted Answer

It depends on the type of position encoding the Transformer uses. Models with learned static position embeddings (such as BERT) cannot go beyond the number of learned positions, simply because they cannot embed the next input for the decoder to produce an output.

The original Transformer for machine translation, uses analytically defined position encoding (so-called sinusoidal encoding) which in theory should generalize for arbitrarily long inputs and outputs. However, in practice, it generalizes badly for sequences that are much longer than those in the training data.

If you want to read more about position encoding in Transformers, you can checkout this survey.

andrea · Answer

There are some alternatives, but you can't go beyond 512 tokens with the HuggingFace implementation of BERT - note that this is not an intrinsic limit to the BERT model, it's just a limit if you're using the very popular HuggingFace implementation, but you could technically code up your own BERT model which takes in longer sequences, people have done that.

For example, the BART model goes up to 1024 tokens. Then there are models which can take up to 16k tokens but they're more custom and not always available out of the box on HuggingFace. One of these is the Longformer for example. Their model can be accessed via HuggingFace as shown here. You may also want to take a look at this recent paper from Google. It is a model specific for text generation (not exactly classification as you asked, but gives you an idea for what's possible) and they have also made their code available (you can see more details here and here)

Is there a maximum sequence length for the output of a transformer?

Tags:

artificial-intelligence

nlp

transformer-model

RyuGood0

2 Answers

Jindřich

andrea

Recent Activity

Donate For Us

Is there a maximum sequence length for the output of a transformer?

Tags:

artificial-intelligence

nlp

transformer-model

RyuGood0

2 Answers

Jindřich

andrea

Related questions

Recent Activity

Donate For Us