How to add all standard special tokens to my hugging face tokenizer and model?

Question

I want all special tokens to always be available. How do I do this?

My first attempt to give it to my tokenizer:

def does_t5_have_sep_token():
    tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained('t5-small')
    assert isinstance(tokenizer, PreTrainedTokenizerFast)
    print(tokenizer)
    print(f'{len(tokenizer)=}')
    # print(f'{tokenizer.all_special_tokens=}')
    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')

    special_tokens_dict = {'additional_special_tokens': ['<bos>', '<cls>', '<s>'] + tokenizer.all_special_tokens }
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')



if __name__ == '__main__':
    does_t5_have_sep_token()
    print('Done\a')

but feels hacky.

refs:

https://github.com/huggingface/tokenizers/issues/247
https://discuss.huggingface.co/t/how-to-add-all-standard-special-tokens-to-my-tokenizer-and-model/21529

seems useful: https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings

I want to add standard tokens by adding the right "standard tokens" the solution provided didn't work for me since the .bos_token is still None. See:

tokenizer.bos_token=None
tokenizer.cls_token=None
tokenizer.sep_token=None
tokenizer.mask_token=None
tokenizer.eos_token='</s>'
tokenizer.unk_token='<unk>'
tokenizer.bos_token_id=None
tokenizer.cls_token_id=None
tokenizer.sep_token_id=None
tokenizer.mask_token_id=None
tokenizer.eos_token_id=1
tokenizer.unk_token_id=2
tokenizer.all_special_tokens=['</s>', '<unk>', '<pad>', '<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>', '<extra_id_43>', '<extra_id_44>', '<extra_id_45>', '<extra_id_46>', '<extra_id_47>', '<extra_id_48>', '<extra_id_49>', '<extra_id_50>', '<extra_id_51>', '<extra_id_52>', '<extra_id_53>', '<extra_id_54>', '<extra_id_55>', '<extra_id_56>', '<extra_id_57>', '<extra_id_58>', '<extra_id_59>', '<extra_id_60>', '<extra_id_61>', '<extra_id_62>', '<extra_id_63>', '<extra_id_64>', '<extra_id_65>', '<extra_id_66>', '<extra_id_67>', '<extra_id_68>', '<extra_id_69>', '<extra_id_70>', '<extra_id_71>', '<extra_id_72>', '<extra_id_73>', '<extra_id_74>', '<extra_id_75>', '<extra_id_76>', '<extra_id_77>', '<extra_id_78>', '<extra_id_79>', '<extra_id_80>', '<extra_id_81>', '<extra_id_82>', '<extra_id_83>', '<extra_id_84>', '<extra_id_85>', '<extra_id_86>', '<extra_id_87>', '<extra_id_88>', '<extra_id_89>', '<extra_id_90>', '<extra_id_91>', '<extra_id_92>', '<extra_id_93>', '<extra_id_94>', '<extra_id_95>', '<extra_id_96>', '<extra_id_97>', '<extra_id_98>', '<extra_id_99>']
Using bos_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using sep_token, but it is not set yet.
Using mask_token, but it is not set yet.

code:

def does_t5_have_sep_token():
    """

    https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/model#transformers.PreTrainedModel.resize_token_embeddings
    """
    import torch
    from transformers import AutoModelForSeq2SeqLM

    tokenizer: PreTrainedTokenizerFast = AutoTokenizer.from_pretrained('t5-small')
    assert isinstance(tokenizer, PreTrainedTokenizerFast)
    print(tokenizer)
    print(f'{len(tokenizer)=}')

    print()
    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.all_special_tokens=}')
    print()

    # special_tokens_dict = {'additional_special_tokens': ['<bos>', '<cls>', '<s>'] + tokenizer.all_special_tokens}
    # num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
    tokenizer.add_tokens([f"_{n}" for n in range(1, 100)], special_tokens=True)
    model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
    assert isinstance(model, torch.nn.Module)
    model.resize_token_embeddings(len(tokenizer))
    # tokenizer.save_pretrained('pathToExtendedTokenizer/')
    # tokenizer = T5Tokenizer.from_pretrained("sandbox/t5_models/pretrained/tokenizer/")

    print()
    print(f'{tokenizer.bos_token=}')
    print(f'{tokenizer.cls_token=}')
    print(f'{tokenizer.sep_token=}')
    print(f'{tokenizer.mask_token=}')
    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.unk_token=}')
    print(f'{tokenizer.bos_token_id=}')
    print(f'{tokenizer.cls_token_id=}')
    print(f'{tokenizer.sep_token_id=}')
    print(f'{tokenizer.mask_token_id=}')
    print(f'{tokenizer.eos_token_id=}')
    print(f'{tokenizer.unk_token_id=}')
    print(f'{tokenizer.all_special_tokens=}')
    print()



if __name__ == '__main__':
    does_t5_have_sep_token()
    print('Done\a')

jroz · Accepted Answer

I do not entirely understand what you're trying to accomplish, but here are some notes that might help:

T5 documentation shows that T5 has only three special tokens (</s>, <unk> and <pad>). You can also see this in the T5Tokenizer class definition. I am confident this is because the original T5 model was trained only with these special tokens (no BOS, no MASK, no CLS).

Running, e.g.,

from transformers import AutoTokenizer
tokenizer =  AutoTokenizer.from_pretrained('t5-small')
print(tokenizer.all_special_tokens)

will show you these three tokens as well as the <extra_id_*> tokens.

Is there a reason you want the other tokens like BOS?

(Edit - to answer your comments): (I really think you would benefit from reading the linked documentation at huggingface. The point of a pretrained model is to take advantage of what has already been done. T5 does not use BOS nor CLS in the way you seem to be imagining. Maybe you can get it to work, but IMO it makes more sense to adapt the task you want to solve to the T5 approach)

</s> is the sep token and is already available.

As I understand, for the T5 model, masking (for the sake of ignoring loss) is implemented using attention_mask. On the other hand, if you want to "fill in the blank" then <extra_id> is used to indicate to the model that it should predict the missing token (this is how semi-supervised pretraining is done). See the section on training in the documentation.

BOS is similar - T5 is not trained to use a BOS token. (E.g. (again from documentation),

Note that T5 uses the pad_token_id as the decoder_start_token_id, so when doing generation without using generate(), make sure you start it with the pad_token_id.

t5 does not use the CLS token. If you want to do classification, you should finetune a new task (or find a corresponding one done in pretraining), finetuning the model to generate a word (or words) that correspond to the classifications you want. (again from documentation:)

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format:

How to add all standard special tokens to my hugging face tokenizer and model?

Tags:

python

machine-learning

deep-learning

huggingface-transformers

Charlie Parker

1 Answers

jroz

Recent Activity

Donate For Us

How to add all standard special tokens to my hugging face tokenizer and model?

Tags:

python

machine-learning

deep-learning

huggingface-transformers

Charlie Parker

1 Answers

jroz

Related questions

Recent Activity

Donate For Us