Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use BERT pretrain embeddings with my own new dataset?

My dataset and NLP task is very different from the large corpus what authors have pre-trained their model (https://github.com/google-research/bert#pre-training-with-bert), so I can't directly fine-tune. Is there any example code/GitHub that can help me to train BERT with my own data? I expect to get embeddings like glove.

Thank you very much!

like image 741
BB8 Avatar asked Nov 23 '25 08:11

BB8


1 Answers

Yes, you can get BERT embeddings, like other word embeddings using extract_features.py script. You have the capability to select the number of layers from which you need the output. Usage is simple, you have to save one sentence per line in a text file and pass it as input. Output will be a JSONL file providing contextual embeddings per token.

The usage of script with documentation is provided at: https://github.com/google-research/bert#using-bert-to-extract-fixed-feature-vectors-like-elmo

like image 157
Ashwin Geet D'Sa Avatar answered Nov 27 '25 15:11

Ashwin Geet D'Sa



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!