Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastai - failed initiation of language model in Sentence Piece Processor, cache_dir parameter

I've been already browsing web for hours to find a solution for my, which i believe so might be a pretty petty issue.

I'm using fastai's Sentence Piece Processor (SPProcesor) at the very first steps of initiation of a language model.

My code for these steps looks like this:

bs = 48

processor = SPProcessor(lang='pl')

data_lm = (TextList.from_csv('', target_corpus, processor=processor)
            .split_by_rand_pct(0.1)
            .label_for_lm()           
            .databunch(bs=bs)
          )
data_lm.save(data_lm_file)

After execution i get an error which is as follows:

~/x/miniconda3/envs/fastai/lib/python3.6/site-packages/fastai/text/data.py in process(self, ds)
    466             self.sp_model,self.sp_vocab = cache_dir/'spm.model',cache_dir/'spm.vocab'
    467         if not getattr(self, 'vocab', False):
--> 468             with open(self.sp_vocab, 'r', encoding=self.enc) as f: self.vocab = Vocab([line.split('\t')[0] for line in f.readlines()])
    469         if self.n_cpus <= 1: ds.items = self._encode_batch(ds.items)
    470         else:

FileNotFoundError: [Errno 2] No such file or directory: 'tmp/spm/spm.vocab'

The proper outcome of the code executed above should be as following:
created folder named 'tmp',
containing folder 'spm',
within which should be placed 2 files named respectively: spm.vocab and spm.model.

What happens instead is that 'tmp' folder is created along with files
named "cache_dir".vocab and "cache_dir".model inside my current directory.
Folder 'spm' is nowhere to be found.

I've found a sort of workaround solution.
It consists of manually creating a 'spm' folder inside 'tmp' and moving those 2 other
mentioned above files into it, and changing their names to spm.vocab and spm.model.

That allows me to carry on with my processing yet I'd like to find a way to skip
that neccessity of manually moving created files and else.

Maybe I need to pass some paramateres (probably cache_dir) with specific values before processing?

If you'd have any idea on how to solve that issue, please point me those.
I'd be grateful.

like image 264
Slonecznik Avatar asked Jan 17 '20 13:01

Slonecznik


1 Answers

I can see similar error if I switch the code in fastai/text/data.py to an earlier version of this commit. Then, if I apply changes from the same commit it all works nicely. Now, the most recent version of the same file (the one which supposed to help with paths with spaces) seems to have yet another bug introduced there.

So pretty much it seems that the problem is that fastai is trying to give argument --model_prefix with quotes to the sentencepiece .SentencePieceTrainer.Train which makes it "misbehave".

One possibility for you would be to either (1) update to the later version of fastai (which might not help due to another bug in a newer version), or (2) manually apply changes from here to your installation's fastai/text/data.py. It's a very small change - just delete the line:

cache_dir = cache_dir/'spm'

and replace

f'--model_prefix="cache_dir" --vocab_size={vocab_sz} --model_type={model_type}']))

with the

f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))

In case you are not comfortable with updating the code of the installation you can monkey-patch the module by substituting existing train_sentencepiece function by writing fixed version in your code and then doing something like fastai.text.data.train_sentencepiece = my_fixed_train_sentencepiece before other calls.

So if you are using newer version of the library the code might look like this:

import fastai
from fastai.core import PathOrStr
from fastai.text.data import ListRules, get_default_size, quotemark, full_char_coverage_langs
from typing import Collection

def train_sentencepiece(texts:Collection[str], path:PathOrStr, pre_rules: ListRules=None, post_rules:ListRules=None,
    vocab_sz:int=None, max_vocab_sz:int=30000, model_type:str='unigram', max_sentence_len:int=20480, lang='en',
    char_coverage=None, tmp_dir='tmp', enc='utf8'):
    "Train a sentencepiece tokenizer on `texts` and save it in `path/tmp_dir`"
    from sentencepiece import SentencePieceTrainer
    cache_dir = Path(path)/tmp_dir
    os.makedirs(cache_dir, exist_ok=True)
    if vocab_sz is None: vocab_sz=get_default_size(texts, max_vocab_sz)
    raw_text_path = cache_dir / 'all_text.out'
    with open(raw_text_path, 'w', encoding=enc) as f: f.write("\n".join(texts))
    spec_tokens = ['\u2581'+s for s in defaults.text_spec_tok]
    SentencePieceTrainer.Train(" ".join([
        f"--input={quotemark}{raw_text_path}{quotemark} --max_sentence_length={max_sentence_len}",
        f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
        f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
        f"--user_defined_symbols={','.join(spec_tokens)}",
        f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
    raw_text_path.unlink()
    return cache_dir
        
fastai.text.data.train_sentencepiece = train_sentencepiece

And if you are using older version, then like the following:

import fastai
from fastai.core import PathOrStr
from fastai.text.data import ListRules, get_default_size, full_char_coverage_langs
from typing import Collection

def train_sentencepiece(texts:Collection[str], path:PathOrStr, pre_rules: ListRules=None, post_rules:ListRules=None, 
    vocab_sz:int=None, max_vocab_sz:int=30000, model_type:str='unigram', max_sentence_len:int=20480, lang='en',
    char_coverage=None, tmp_dir='tmp', enc='utf8'):
    "Train a sentencepiece tokenizer on `texts` and save it in `path/tmp_dir`"
    from sentencepiece import SentencePieceTrainer
    cache_dir = Path(path)/tmp_dir
    os.makedirs(cache_dir, exist_ok=True)
    if vocab_sz is None: vocab_sz=get_default_size(texts, max_vocab_sz)
    raw_text_path = cache_dir / 'all_text.out'
    with open(raw_text_path, 'w', encoding=enc) as f: f.write("\n".join(texts))
    spec_tokens = ['\u2581'+s for s in defaults.text_spec_tok]
    SentencePieceTrainer.Train(" ".join([
        f"--input={raw_text_path} --max_sentence_length={max_sentence_len}",
        f"--character_coverage={ifnone(char_coverage, 0.99999 if lang in full_char_coverage_langs else 0.9998)}",
        f"--unk_id={len(defaults.text_spec_tok)} --pad_id=-1 --bos_id=-1 --eos_id=-1",
        f"--user_defined_symbols={','.join(spec_tokens)}",
        f"--model_prefix={cache_dir/'spm'} --vocab_size={vocab_sz} --model_type={model_type}"]))
    raw_text_path.unlink()
    return cache_dir
        
fastai.text.data.train_sentencepiece = train_sentencepiece
like image 100
Alexander Pivovarov Avatar answered Oct 17 '22 04:10

Alexander Pivovarov