Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add all standard special tokens to my hugging face tokenizer and model?

I want all special tokens to always be available. How do I do this?

My first attempt to give it to my tokenizer:

def does_t5_have_sep_token():
    tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained('t5-small')
    assert isinstance(tokenizer, PreTrainedTokenizerFast)
    print(tokenizer)
    print(f'{len(tokenizer)=}')
    # print(f'{tokenizer.all_special_tokens=}')
    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')

    special_tokens_dict = {'additional_special_tokens': ['<bos>', '<cls>', '<s>'] + tokenizer.all_special_tokens }
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')



if __name__ == '__main__':
    does_t5_have_sep_token()
    print('Done\a')

but feels hacky.

refs:

  • https://github.com/huggingface/tokenizers/issues/247
  • https://discuss.huggingface.co/t/how-to-add-all-standard-special-tokens-to-my-tokenizer-and-model/21529

seems useful: https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings


I want to add standard tokens by adding the right "standard tokens" the solution provided didn't work for me since the .bos_token is still None. See:

tokenizer.bos_token=None
tokenizer.cls_token=None
tokenizer.sep_token=None
tokenizer.mask_token=None
tokenizer.eos_token='</s>'
tokenizer.unk_token='<unk>'
tokenizer.bos_token_id=None
tokenizer.cls_token_id=None
tokenizer.sep_token_id=None
tokenizer.mask_token_id=None
tokenizer.eos_token_id=1
tokenizer.unk_token_id=2
tokenizer.all_special_tokens=['</s>', '<unk>', '<pad>', '<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>', '<extra_id_43>', '<extra_id_44>', '<extra_id_45>', '<extra_id_46>', '<extra_id_47>', '<extra_id_48>', '<extra_id_49>', '<extra_id_50>', '<extra_id_51>', '<extra_id_52>', '<extra_id_53>', '<extra_id_54>', '<extra_id_55>', '<extra_id_56>', '<extra_id_57>', '<extra_id_58>', '<extra_id_59>', '<extra_id_60>', '<extra_id_61>', '<extra_id_62>', '<extra_id_63>', '<extra_id_64>', '<extra_id_65>', '<extra_id_66>', '<extra_id_67>', '<extra_id_68>', '<extra_id_69>', '<extra_id_70>', '<extra_id_71>', '<extra_id_72>', '<extra_id_73>', '<extra_id_74>', '<extra_id_75>', '<extra_id_76>', '<extra_id_77>', '<extra_id_78>', '<extra_id_79>', '<extra_id_80>', '<extra_id_81>', '<extra_id_82>', '<extra_id_83>', '<extra_id_84>', '<extra_id_85>', '<extra_id_86>', '<extra_id_87>', '<extra_id_88>', '<extra_id_89>', '<extra_id_90>', '<extra_id_91>', '<extra_id_92>', '<extra_id_93>', '<extra_id_94>', '<extra_id_95>', '<extra_id_96>', '<extra_id_97>', '<extra_id_98>', '<extra_id_99>']
Using bos_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using mask_token, but it is not set yet.

code:

def does_t5_have_sep_token():
    """

    https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings
    """
    import torch
    from transformers import AutoModelForSeq2SeqLM

    tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained('t5-small')
    assert isinstance(tokenizer, PreTrainedTokenizerFast)
    print(tokenizer)
    print(f'{len(tokenizer)=}')

    print()
    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')
    print()

    # special_tokens_dict = {'additional_special_tokens': ['<bos>', '<cls>', '<s>'] + tokenizer.all_special_tokens}
    # num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
    tokenizer.add_tokens([f"_{n}" for n in range(1, 100)], special_tokens=True)
    model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
    assert isinstance(model, torch.nn.Module)
    model.resize_token_embeddings(len(tokenizer))
    # tokenizer.save_pretrained('pathToExtendedTokenizer/')
    # tokenizer = T5Tokenizer.from_pretrained("sandbox/t5_models/pretrained/tokenizer/")

    print()
    print(f'{tokenizer.bos_token=}')
    print(f'{tokenizer.cls_token=}')
    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.mask_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.unk_token=}')
    print(f'{tokenizer.bos_token_id=}')
    print(f'{tokenizer.cls_token_id=}')
    print(f'{tokenizer.sep_token_id=}')
    print(f'{tokenizer.mask_token_id=}')
    print(f'{tokenizer.eos_token_id=}')
    print(f'{tokenizer.unk_token_id=}')
    print(f'{tokenizer.all_special_tokens=}')
    print()



if __name__ == '__main__':
    does_t5_have_sep_token()
    print('Done\a')
like image 799
Charlie Parker Avatar asked Oct 23 '25 00:10

Charlie Parker


1 Answers

I do not entirely understand what you're trying to accomplish, but here are some notes that might help:

T5 documentation shows that T5 has only three special tokens (</s>, <unk> and <pad>). You can also see this in the T5Tokenizer class definition. I am confident this is because the original T5 model was trained only with these special tokens (no BOS, no MASK, no CLS).

Running, e.g.,

from transformers import AutoTokenizer
tokenizer =  AutoTokenizer.from_pretrained('t5-small')
print(tokenizer.all_special_tokens)

will show you these three tokens as well as the <extra_id_*> tokens.

Is there a reason you want the other tokens like BOS?

(Edit - to answer your comments): (I really think you would benefit from reading the linked documentation at huggingface. The point of a pretrained model is to take advantage of what has already been done. T5 does not use BOS nor CLS in the way you seem to be imagining. Maybe you can get it to work, but IMO it makes more sense to adapt the task you want to solve to the T5 approach)

</s> is the sep token and is already available.

As I understand, for the T5 model, masking (for the sake of ignoring loss) is implemented using attention_mask. On the other hand, if you want to "fill in the blank" then <extra_id> is used to indicate to the model that it should predict the missing token (this is how semi-supervised pretraining is done). See the section on training in the documentation.

BOS is similar - T5 is not trained to use a BOS token. (E.g. (again from documentation),

Note that T5 uses the pad_token_id as the decoder_start_token_id, so when doing generation without using generate(), make sure you start it with the pad_token_id.

t5 does not use the CLS token. If you want to do classification, you should finetune a new task (or find a corresponding one done in pretraining), finetuning the model to generate a word (or words) that correspond to the classifications you want. (again from documentation:)

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format:

like image 80
jroz Avatar answered Oct 24 '25 14:10

jroz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!