I wanted to employ the examples/run_lm_finetuning.py
from the Huggingface Transformers repository on a pretrained Bert model. However, from following the documentation it is not evident how a corpus file should be structured (apart from referencing the Wiki-2 dataset). I've tried
By looking at the code of examples/run_lm_finetuning.py
it is not directly evident how sequence pairs for the Next Sentence Prediction objective are formed. Would the --line-by-line
option help here? I'd be grateful, if someone could give me some hints how a text corpus file should look like.
Many thanks and cheers,
nminds
First of all, I strongly suggest to also open this as an issue in the huggingface library, as they have probably the strongest interest to answer this, and may take it as a sign that they should update/clarify their documentation.
But to answer your question, it seems that this specific sample script is basically returning either a LineByLineTextDataset
(if you pass --line_by_line
to the training), and otherwise a TextDataset
, see ll. 144-149 in the script (formatted slightly for better visibility):
def load_and_cache_examples(args, tokenizer, evaluate=False):
file_path = args.eval_data_file if evaluate else args.train_data_file
if args.line_by_line:
return LineByLineTextDataset(tokenizer, args,
file_path=file_path, block_size=args.block_size)
else:
return TextDataset(tokenizer, args,
file_path=file_path, block_size=args.block_size)
A TextDataset
simply splits the text into consecutive "blocks" of certain (token) length, e.g., it will cut your text every 512 tokens (default value).
The Next Sentence Prediction task is only implemented for the default BERT model, if I recall that correctly (seems to be consistent with what I found in the documentation), and is unfortunately not part of this specific finetuning script.
None of the utilized BERT models in the lm_finetuning
script make use of that particular task, as far as I can see.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With