How to truncate a Bert tokenizer in Transformers library

Question

I am using the Scibert pretrained model to get embeddings for various texts. The code is as follows:

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512, truncation=True)
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

I have added both the max length and truncation parameters to tokenizers, but unfortunately, they don't truncate the results.If I run a longer text through the tokenizer:

inputs = tokenizer("""long text""")

I get the following error:

Token indices sequence length is longer than the specified maximum sequence length for this model (605 > 512). Running this sequence through the model will result in indexing errors

Now obviously I can't run this through the model due to having too long sequences of tensors. What is the easiest way to truncate the input to fit the maximum sequence length of 512?

cronoik · Accepted Answer

truncation is not a parameter of the class constructor (class reference), but a parameter of the __call__ method. Therefore you should use:

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512)

len(tokenizer(text, truncation=True).input_ids)

Output:

How to truncate a Bert tokenizer in Transformers library

Tags:

python

nlp

huggingface-transformers

Tomaž Bratanič

1 Answers

cronoik

Recent Activity

Donate For Us

How to truncate a Bert tokenizer in Transformers library

Tags:

python

nlp

huggingface-transformers

Tomaž Bratanič

1 Answers

cronoik

Related questions

Recent Activity

Donate For Us