Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AutoTokenizer.from_pretrained fails to load locally saved pretrained tokenizer (PyTorch)

I am new to PyTorch and recently, I have been trying to work with Transformers. I am using pretrained tokenizers provided by HuggingFace.
I am successful in downloading and running them. But if I try to save them and load again, then some error occurs.
If I use AutoTokenizer.from_pretrained to download a tokenizer, then it works.

[1]:    tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
        text = "Hello there"
        enc = tokenizer.encode_plus(text)
        enc.keys()

Out[1]: dict_keys(['input_ids', 'attention_mask'])

But if I save it using tokenizer.save_pretrained("distilroberta-tokenizer") and try to load it locally, then it fails.

[2]:    tmp = AutoTokenizer.from_pretrained('distilroberta-tokenizer')


---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    238                 resume_download=resume_download,
--> 239                 local_files_only=local_files_only,
    240             )

/opt/conda/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
    266         # File, but it doesn't exist.
--> 267         raise EnvironmentError("file {} not found".format(url_or_filename))
    268     else:

OSError: file distilroberta-tokenizer/config.json not found

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-25-3bd2f7a79271> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained("distilroberta-tokenizer")

/opt/conda/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    193         config = kwargs.pop("config", None)
    194         if not isinstance(config, PretrainedConfig):
--> 195             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    196 
    197         if "bert-base-japanese" in pretrained_model_name_or_path:

/opt/conda/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    194 
    195         """
--> 196         config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
    197 
    198         if "model_type" in config_dict:

/opt/conda/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
    250                 f"- or '{pretrained_model_name_or_path}' is the correct path to a directory containing a {CONFIG_NAME} file\n\n"
    251             )
--> 252             raise EnvironmentError(msg)
    253 
    254         except json.JSONDecodeError:

OSError: Can't load config for 'distilroberta-tokenizer'. Make sure that:

- 'distilroberta-tokenizer' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'distilroberta-tokenizer' is the correct path to a directory containing a config.json file

Its saying 'config.josn' is missing form the directory. On checking the directory, I am getting list of these files:

[3]:    !ls distilroberta-tokenizer

Out[3]: merges.txt  special_tokens_map.json  tokenizer_config.json  vocab.json

I know this problem has been posted earlier but none of them seems to work. I have also tried to follow the docs but still can't make it work.
Any help would be appreciated.

like image 752
ferty567 Avatar asked Jan 25 '23 23:01

ferty567


2 Answers

There is currently an issue under investigation which only affects the AutoTokenizers but not the underlying tokenizers like (RobertaTokenizer). For example the following should work:

from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained('YOURPATH')

To work with the AutoTokenizer you also need to save the config to load it offline:

from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
config = AutoConfig.from_pretrained('distilroberta-base')

tokenizer.save_pretrained('YOURPATH')
config.save_pretrained('YOURPATH')

tokenizer = AutoTokenizer.from_pretrained('YOURPATH')

I recommend to either use a different path for the tokenizers and the model or to keep the config.json of your model because some modifications you apply to your model will be stored in the config.json which is created during model.save_pretrained() and will be overwritten when you save the tokenizer as described above after your model (i.e. you won't be able to load your modified model with tokenizer config.json).

like image 150
cronoik Avatar answered Jan 28 '23 11:01

cronoik


I see several issues in your code which I listed below:

  1. distilroberta-tokenizer is a directory containing the vocab config, etc files. Please make sure to create this dir first.

  2. Using AutoTokenizer works if this dir contains config.json and NOT tokenizer_config.json. So, please rename this file.

I modified your code below and it works.

dir_name = "distilroberta-tokenizer"

if os.path.isdir(dir_name) == False:
    os.mkdir(dir_name)  

tokenizer.save_pretrained(dir_name)

#Rename config file now

#tmp = AutoTokenizer.from_pretrained(dir_name)   

I hope this helps!

Thanks!

like image 22
user12769533 Avatar answered Jan 28 '23 13:01

user12769533