How exactly should the input file be formatted for the language model finetuning (BERT through Huggingface Transformers)?

Question

I wanted to employ the examples/run_lm_finetuning.py from the Huggingface Transformers repository on a pretrained Bert model. However, from following the documentation it is not evident how a corpus file should be structured (apart from referencing the Wiki-2 dataset). I've tried

One document per line (multiple sentences)
One sentence per line. Documents are separated by a blank line (this I found in some older pytorch-transformers documentation)

By looking at the code of examples/run_lm_finetuning.py it is not directly evident how sequence pairs for the Next Sentence Prediction objective are formed. Would the --line-by-line option help here? I'd be grateful, if someone could give me some hints how a text corpus file should look like.

Many thanks and cheers,

nminds

dennlinger · Accepted Answer

First of all, I strongly suggest to also open this as an issue in the huggingface library, as they have probably the strongest interest to answer this, and may take it as a sign that they should update/clarify their documentation.

But to answer your question, it seems that this specific sample script is basically returning either a LineByLineTextDataset (if you pass --line_by_line to the training), and otherwise a TextDataset, see ll. 144-149 in the script (formatted slightly for better visibility):

def load_and_cache_examples(args, tokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    if args.line_by_line:
        return LineByLineTextDataset(tokenizer, args, 
                           file_path=file_path, block_size=args.block_size)
    else:
        return TextDataset(tokenizer, args, 
                           file_path=file_path, block_size=args.block_size)

A TextDataset simply splits the text into consecutive "blocks" of certain (token) length, e.g., it will cut your text every 512 tokens (default value).

The Next Sentence Prediction task is only implemented for the default BERT model, if I recall that correctly (seems to be consistent with what I found in the documentation), and is unfortunately not part of this specific finetuning script. None of the utilized BERT models in the lm_finetuning script make use of that particular task, as far as I can see.

How exactly should the input file be formatted for the language model finetuning (BERT through Huggingface Transformers)?

Tags:

python

pytorch

bert-language-model

huggingface-transformers

nminds

1 Answers

dennlinger

Recent Activity

Donate For Us

How exactly should the input file be formatted for the language model finetuning (BERT through Huggingface Transformers)?

Tags:

python

pytorch

bert-language-model

huggingface-transformers

nminds

1 Answers

dennlinger

Related questions

Recent Activity

Donate For Us