Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to truncate a Bert tokenizer in Transformers library

I am using the Scibert pretrained model to get embeddings for various texts. The code is as follows:

from transformers import *

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512, truncation=True)
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')

I have added both the max length and truncation parameters to tokenizers, but unfortunately, they don't truncate the results.If I run a longer text through the tokenizer:

inputs = tokenizer("""long text""")

I get the following error:

Token indices sequence length is longer than the specified maximum sequence length for this model (605 > 512). Running this sequence through the model will result in indexing errors

Now obviously I can't run this through the model due to having too long sequences of tensors. What is the easiest way to truncate the input to fit the maximum sequence length of 512?

like image 998
Tomaž Bratanič Avatar asked Dec 30 '22 18:12

Tomaž Bratanič


1 Answers

truncation is not a parameter of the class constructor (class reference), but a parameter of the __call__ method. Therefore you should use:

tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512)

len(tokenizer(text, truncation=True).input_ids)

Output:

512
like image 81
cronoik Avatar answered Jan 08 '23 07:01

cronoik