I have just followed this tutorial on how to train my own tokenizer.
Now, from training my tokenizer, I have wrapped it inside a Transformers object, so that I can use it with the transformers library:
from transformers import BertTokenizerFast
new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
Then, I try to save my tokenizer using this code:
tokenizer.save_pretrained('/content/drive/MyDrive/Tokenzier')
But I get this error:
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'save_pretrained'
Am I saving the tokenizer wrongly?
If so, what is the correct approach to save it to my local files, so that I can use it later?
If you are building a custom tokenizer, you can save & load it like this:
from tokenizers import Tokenizer
# Save
tokenizer.save('saved_tokenizer.json')
# Load
tokenizer = Tokenizer.from_file('saved_tokenizer.json')
save_pretrained() only works if you train from a pre-trained tokenizer like this:
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained("the_pretrained_model_in_hf")
tokenizer = old_tokenizer.train_new_from_iterator(get_training_corpus(), 52000)
tokenizer.save_pretrained("your-tokenizer")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With