Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BERT classification on imbalanced or small dataset

I have a large corpus, no labels. I trained this corpus to get my BERT tokenizer.

Then I want to build a BertModel to do a binary classification on a labeled dataset. However, this dataset is highly imbalanced, 1: 99. So my question is:

  1. Does BertModel would perform well on imbalanced dataset?
  2. Does BertModel would perform well on small dataset? (as small as less than 500 data points, I bet it's not..)
like image 773
duoduolikes Avatar asked Feb 11 '26 11:02

duoduolikes


1 Answers

The objective of transferred learning using pre-trained models partially answers your questions. BertModel pre-trained on large corpus, which when adapted to task specific corpus, usually performs better than non pre-trained models (for example, training a simple LSTM for classification task).

BERT has shown that it performs well when fine-tuned on small task-specific corpus. (This answers your question 2.). However, the level of improvements also depend on the domain and task that you want to perform, and how related was the data used for pre-training is with respect to your target dataset.

From my experience, pre-trained BERT when fine-tuned on target task performs much better than other DNNs such as LSTM and CNNs when the datasets are highly imbalanced. However, this again depends on the task and data. 1:99 is really a huge imbalance, which might require data balancing techniques.

like image 134
Ashwin Geet D'Sa Avatar answered Feb 15 '26 20:02

Ashwin Geet D'Sa



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!