BERT performing worse than word2vec

Question

I am trying to use BERT for a document ranking problem. My task is pretty straightforward. I have to do a similarity ranking for an input document. The only issue here is that I don’t have labels - so it’s more of a qualitative analysis.

I am on my way to try a bunch of document representation techniques - word2vec, para2vec and BERT mainly.

For BERT, i came across Hugging face - Pytorch library. I fine tuned the bert-base-uncased model, with around 150,000 documents. I ran it for 5 epochs, with a batch size of 16 and max seq length 128. However, if I compare the performance of Bert representation vs word2vec representations, for some reason word2vec is performing better for me right now. For BERT, I used the last four layers for getting the representation.

I am not too sure why the fine tuned model didn’t work. I read up this paper, and this other link also that said that BERT performs well when fine tuned for a classification task. However, since I don’t have the labels, I fined tuned it as it's done in the paper - in an unsupervised manner.

Also, my documents vary a lot in their length. So I’m sending them sentence wise right now. In the end I have to average over the word embeddings anyway to get the sentence embedding. Any ideas on a better method? I also read here - that there are different ways of pooling over the word embeddings to get a fixed embedding. Wondering if there is a comparison of which pooling technique works better?

Any help on training BERT better or a better pooling method will be greatly appreciated!

Henryk Borzymowski · Accepted Answer

You can check out this blog post:

BERT even has a special [CLS] token whose output embedding is used for classification tasks, but still turns out to be a poor embedding of the input sequence for other tasks. [Reimers & Gurevych, 2019]

Sentence-BERT, presented in [Reimers & Gurevych, 2019] and accompanied by a Python implementation, aims to adapt the BERT architecture by using siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity

BERT performing worse than word2vec

Tags:

machine-learning

deep-learning

unsupervised-learning

word2vec

bert-language-model

user3741951

1 Answers

Henryk Borzymowski

Recent Activity

Donate For Us

BERT performing worse than word2vec

Tags:

machine-learning

deep-learning

unsupervised-learning

word2vec

bert-language-model

user3741951

1 Answers

Henryk Borzymowski

Related questions

Recent Activity

Donate For Us