Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do Tokenizer Batch processing? - HuggingFace

in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says:

text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

things run normally if I run:

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
 tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")

but if I try to emulate batches of sentences:

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 test = [test, test]
 tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
 tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")

I get:

Traceback (most recent call last):
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/modify_scores.py", line 53, in <module>
    tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one
    return self.batch_encode_plus(
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825, in batch_encode_plus
    return self._batch_encode_plus(
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 428, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Is the documentation wrong? I just need a way to tokenize and predict using batches, it shouldn't be that hard.

Is it something to do with the is_split_into_words arguments?


Contextualizing

I will feed that into a sentiment score model (the one defined in the code snippets). I am facing OOM problems when predicting it so I need to feed the data in batches to the model.

The documentation (refered above) stated that I can feed List[List[str]] in the tokenizer which is not the case. The question remains the same: How to tokenize batches of sentences?

Note: I don't need the tokenizing process to be in batches (although it would yield batches of tokens/attention_tokens), which would solve my problem: using the model for prediction with batches like this:

with torch.no_grad():
    logits = model(**tokenized_test).logits

like image 238
Lucas Azevedo Avatar asked Apr 20 '26 12:04

Lucas Azevedo


1 Answers

How to tokenize a list of sentences?

If it's just tokenizing a list of sentences, do this:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 
tokenizer(test)

It does the batching automatically:

{'input_ids': [
 [101, 7592, 2023, 2003, 1037, 3231, 102], [101, 2008, 21743, 1037, 2862, 1997, 11746, 102], 
 [101, 2046, 1037, 2862, 1997, 2862, 1997, 11746, 102], 
 [101, 1999, 2344, 2000, 7861, 9869, 1010, 1999, 2023, 2553, 1010, 2048, 14108, 2229, 1997, 1996, 2168, 18798, 13900, 102], 
 [101, 2000, 2022, 19204, 3550, 2011, 1996, 1044, 2546, 19204, 17629, 2005, 1996, 4225, 2944, 102]], 

'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

How to use it with the AutoModelForSequenceClassification?

And to use it with AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english'), it's this:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]

model(**tokenizer(test, return_tensors='pt', padding=True, truncation=True))

[out]:

SequenceClassifierOutput(loss=None, logits=tensor([[ 1.5094, -1.2056],
        [-3.4114,  3.5229],
        [ 1.8835, -1.6886],
        [ 3.0780, -2.5745],
        [ 2.5383, -2.1984]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

How to use the distilbert-base-uncased-finetuned-sst-2-english model for sentiment classification?

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']
 
classifier(text)

[out]:

[{'label': 'NEGATIVE', 'score': 0.9379092454910278},
 {'label': 'POSITIVE', 'score': 0.9990271329879761},
 {'label': 'NEGATIVE', 'score': 0.9726701378822327},
 {'label': 'NEGATIVE', 'score': 0.9965035915374756},
 {'label': 'NEGATIVE', 'score': 0.9913086891174316}]

What happens when I've OOM issues with GPU?

If it's the distilbert-base-uncased-finetuned-sst-2-english, you should just use the CPU. For that you won't face much OOM issues.

If you need to use a GPU, consider using the pipeline(...) inference and it comes with the batch_size option, e.g.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']

classifier(text, batch_size=2, truncation="only_first")

When you face OOM issues, it is usually not the tokenizer creating the problem unless you loaded the full large dataset into the device.

If it is just the model not being able to predict when you feed in the large dataset, consider using pipeline instead of using the model(**tokenize(text))

Take a look at https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching


If the question is regarding the is_split_into_words arguments, then from the doc

text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

And from the code

if is_split_into_words:
    is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
else:
    is_batched = isinstance(text, (list, tuple))

And if we try that to see if your inputs is_batched:

text = ["hello", "this", "is a test"]
isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))

[out]:

False

But when you wrap the tokens around a list,

text = [["hello", "this", "is a test"]]
isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))

[out]:

True

Therefore, the usage of the tokenizer and is_split_into_words=True to get the batch processing working properly would look something like this:

from transformers import AutoTokenizer
from sacremoses import MosesTokenizer

moses = MosesTokenizer()
sentences = ["this is a test", "hello world"]
pretokenized_sents = [moses.tokenize(s) for s in sentences]

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

tokenizer(
  text=pretokenized_sents, 
  padding="max_length", 
  is_split_into_words=True, 
  truncation=True, 
  return_tensors="pt"
)

[out]:

{'input_ids': tensor([[ 101, 2023, 2003,  ...,    0,    0,    0],
        [ 101, 7592, 2088,  ...,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

Note: The use of the is_split_into_words argument is not to process batches of sentence but it's used to specify when your input to the tokenizers are already pre-tokenized.

like image 195
alvas Avatar answered Apr 23 '26 04:04

alvas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!