So I am doing tokenization of my dataset, and created one function,
max_length = 1026
def generate_and_tokenize_prompt(prompt):
result = tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=max_length,
padding="max_length",
)
return result
train_dataset = df_train['prompt']
val_dataset = df_test['prompt']
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = val_dataset.map(generate_and_tokenize_prompt)
Here you can see we are using return_tensors="pt"
, but I am not sure why are using it. Because even without this parameters, I am able to tokenize my dataset.
"pt" means return pytorch tensor. See documentation https://huggingface.co/docs/transformers/main_classes/tokenizer
return_tensors
parameter determines the format in which the tokenized output is returned. This affects how the data can be used in subsequent steps, especially when preparing inputs for a model.
More ofter, than not, setting return_tensors
to True means you will use inputs for the forward pass to the model.
When to set return_tensors = True
inputs = tokenizer("Data Science is awesome", return_tensors='pt')
print(inputs)
>>{'input_ids': tensor([[ 101, 2951, 2671, 2003, 12476, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])
When you can avoid setting return_tensors = True
inputs = tokenizer(text)
print(inputs)
>>{'input_ids': [101, 2951, 2671, 2003, 12476, 102], 'attention_mask': [1, 1, 1, 1, 1, 1]}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With