I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT? I am looking here currently. My main objective is to get sentence embeddings using BERT.

The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining. The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels. Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective (<code>[MASK]</code>ing specific words and trying to predict what word should be there), for which you do not need labeled data. If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings. Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!

How to fine tune BERT on unlabeled data?

2 Answers

The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.

The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.

Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.

If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.

Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!

143

answered Sep 27 '22 23:09

dennlinger

@dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.

But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.

answered Sep 27 '22 22:09

pashok3ddd

Related questions
                            
                                most efficient edit distance to identify misspellings in names?
                            
                                Stanford CoreNLP sentiment
                            
                                Independent clause boundary disambiguation, and independent clause segmentation – any tools to do this?
                            
                                Can I control the way the CountVectorizer vectorizes the corpus in scikit learn?
                            
                                Updating the feature names into scikit TFIdfVectorizer
                            
                                Language detection API/Library [closed]
                            
                                Using scikit-learn to training an NLP log linear model for NER
                            
                                Training Tagger with Custom Tags in NLTK
                            
                                Why Stanford parser with nltk is not correctly parsing a sentence?
                            
                                Multi-Threaded NLP with Spacy pipe
                            
                                Named Entity Recognition with Syntaxnet
                            
                                Syntactic similarity/distance between 2 sentences/string/text using nltk [duplicate]
                            
                                Spacy Japanese Tokenizer
                            
                                gensim - Word2vec continue training on existing model - AttributeError: 'Word2Vec' object has no attribute 'compute_loss'
                            
                                Fasttext algorithm use only word and subword? or sentences too?
                            
                                Extracting names from a text file using Spacy
                            
                                Custom sentence segmentation in Spacy
                            
                                Python/Gensim - What is the meaning of syn0 and syn0norm?
                            
                                Measure similarity between two documents using Doc2Vec
                            
                                Conditional word frequency count in Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to fine tune BERT on unlabeled data?

Tags:

nlp

pytorch

bert-language-model

huggingface-transformers

Rish

People also ask

2 Answers

dennlinger

pashok3ddd

Recent Activity

Donate For Us