I am using the Scibert pretrained model to get embeddings for various texts. The code is as follows:
from transformers import *
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512, truncation=True)
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')
I have added both the max length and truncation parameters to tokenizers, but unfortunately, they don't truncate the results.If I run a longer text through the tokenizer:
inputs = tokenizer("""long text""")
I get the following error:
Token indices sequence length is longer than the specified maximum sequence length for this model (605 > 512). Running this sequence through the model will result in indexing errors
Now obviously I can't run this through the model due to having too long sequences of tensors. What is the easiest way to truncate the input to fit the maximum sequence length of 512?
truncation
is not a parameter of the class constructor (class reference), but a parameter of the __call__
method. Therefore you should use:
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased', model_max_length=512)
len(tokenizer(text, truncation=True).input_ids)
Output:
512
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With