Why we use return_tensors = "pt" during tokenization?

Question

So I am doing tokenization of my dataset, and created one function,

max_length = 1026

def generate_and_tokenize_prompt(prompt):
    result = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    return result

train_dataset = df_train['prompt']
val_dataset = df_test['prompt']
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = val_dataset.map(generate_and_tokenize_prompt)

Here you can see we are using return_tensors="pt", but I am not sure why are using it. Because even without this parameters, I am able to tokenize my dataset.

Dtoc · Accepted Answer

"pt" means return pytorch tensor. See documentation https://huggingface.co/docs/transformers/main_classes/tokenizer

ourendingdays · Answer

return_tensors parameter determines the format in which the tokenized output is returned. This affects how the data can be used in subsequent steps, especially when preparing inputs for a model.

More ofter, than not, setting return_tensors to True means you will use inputs for the forward pass to the model.

When to set return_tensors = True

Model compatibility: When you are using a model that expects inputs in tensor form (PyTorch, TensorFlow, JAX models). This ensures that the data is in the correct, model-specific format.
Processing: tensors are faster to compute, than, say, numpy arrays

inputs = tokenizer("Data Science is awesome", return_tensors='pt')
print(inputs)
>>{'input_ids': tensor([[  101,  2951,  2671,  2003, 12476,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])

When you can avoid setting return_tensors = True

Preprocessing and analysis: If you are in the preprocessing stage and want to inspect the tokenized outputs or manipulate the data before converting it to tensors

inputs = tokenizer(text)
print(inputs)
>>{'input_ids': [101, 2951, 2671, 2003, 12476, 102], 'attention_mask': [1, 1, 1, 1, 1, 1]}

Why we use return_tensors = "pt" during tokenization?

Tags:

huggingface-tokenizers

large-language-model

huggingface

MSY

2 Answers

Dtoc

ourendingdays

Recent Activity

Donate For Us

Why we use return_tensors = "pt" during tokenization?

Tags:

huggingface-tokenizers

large-language-model

huggingface

MSY

2 Answers

Dtoc

ourendingdays

Related questions

Recent Activity

Donate For Us