Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How exactly should the input file be formatted for the language model finetuning (BERT through Huggingface Transformers)?

I wanted to employ the examples/run_lm_finetuning.py from the Huggingface Transformers repository on a pretrained Bert model. However, from following the documentation it is not evident how a corpus file should be structured (apart from referencing the Wiki-2 dataset). I've tried

  • One document per line (multiple sentences)
  • One sentence per line. Documents are separated by a blank line (this I found in some older pytorch-transformers documentation)

By looking at the code of examples/run_lm_finetuning.py it is not directly evident how sequence pairs for the Next Sentence Prediction objective are formed. Would the --line-by-line option help here? I'd be grateful, if someone could give me some hints how a text corpus file should look like.

Many thanks and cheers,

nminds

like image 306
nminds Avatar asked Jan 31 '20 10:01

nminds


1 Answers

First of all, I strongly suggest to also open this as an issue in the huggingface library, as they have probably the strongest interest to answer this, and may take it as a sign that they should update/clarify their documentation.

But to answer your question, it seems that this specific sample script is basically returning either a LineByLineTextDataset (if you pass --line_by_line to the training), and otherwise a TextDataset, see ll. 144-149 in the script (formatted slightly for better visibility):

def load_and_cache_examples(args, tokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.line_by_line:
        return LineByLineTextDataset(tokenizer, args, 
                           file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(tokenizer, args, 
                           file_path=file_path, block_size=args.block_size)

A TextDataset simply splits the text into consecutive "blocks" of certain (token) length, e.g., it will cut your text every 512 tokens (default value).

The Next Sentence Prediction task is only implemented for the default BERT model, if I recall that correctly (seems to be consistent with what I found in the documentation), and is unfortunately not part of this specific finetuning script. None of the utilized BERT models in the lm_finetuning script make use of that particular task, as far as I can see.

like image 92
dennlinger Avatar answered Sep 19 '22 10:09

dennlinger