Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to change huggingface transformers default cache directory

The default cache directory is lack of disk capacity, I need change the configure of the default cache directory.

like image 471
Ivan Lee Avatar asked Aug 08 '20 07:08

Ivan Lee


People also ask

Where is huggingface cache?

So if you don't have any specific environment variable set, the cache directory will be at ~/. cache/huggingface/transformers/ .

What is dataset cache?

What is a Cached Dataset? Essentially, Cached Datasets are a way to pre-compute data for hundreds or thousands of entities at once. Once setup, you can retrieve results for any one of those entities instantly.


Video Answer


3 Answers

You can specify the cache directory everytime you load a model with .from_pretrained by the setting the parameter cache_dir. You can define a default location by exporting an environment variable TRANSFORMERS_CACHE everytime before you use (i.e. before importing it!) the library).

Example for python:

import os
os.environ['TRANSFORMERS_CACHE'] = '/blabla/cache/'

Example for bash:

export TRANSFORMERS_CACHE=/blabla/cache/
like image 62
cronoik Avatar answered Oct 24 '22 09:10

cronoik


As @cronoik mentioned, alternative to modify the cache path in the terminal, you can modify the cache directory directly in your code. I will just provide you with the actual code if you are having any difficulty looking it up on HuggingFace:

tokenizer = AutoTokenizer.from_pretrained("roberta-base", cache_dir="new_cache_dir/")

model = AutoModelForMaskedLM.from_pretrained("roberta-base", cache_dir="new_cache_dir/")
like image 23
aysljc Avatar answered Oct 24 '22 11:10

aysljc


I'm writing this answer because there are other Hugging Face cache directories that also eat space in the home directory besides the model cache and the previous answers and comments did not make this clear.

The Transformers documentation describes how the default cache directory is determined:

Cache setup

Pretrained models are downloaded and locally cached at: ~/.cache/huggingface/transformers/. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE. On Windows, the default directory is given by C:\Users\username.cache\huggingface\transformers. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory:

  1. Shell environment variable (default): TRANSFORMERS_CACHE.
  2. Shell environment variable: HF_HOME + transformers/.
  3. Shell environment variable: XDG_CACHE_HOME + /huggingface/transformers.

What this piece of documentation doesn't explicitly mention is that HF_HOME defaults to $XDG_CACHE_HOME/huggingface and is used for other huggingface caches, e.g. the datasets cache, which is separate from the transformers cache. The value of XDG_CACHE_HOME is machine dependent, but usually it is $HOME/.cache (and HF defaults to this value if XDG_CACHE_HOME is not set) - thus the usual default $HOME/.cache/huggingface

So you probably will want to change the HF_HOME environment variable (and possibly set a symlink to catch cases where the environment variable is not set).

This environment variable is also respected by Hugging Face datasets library, although the documentation does not explicitly state this.

like image 2
Bernhard Stadler Avatar answered Oct 24 '22 10:10

Bernhard Stadler