Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cased VS uncased BERT models in spacy and train data

I want to use spacy's pretrained BERT model for text classification but I'm a little confused about cased/uncased models. I read somewhere that cased models should only be used when there is a chance that letter casing will be helpful for the task. In my specific case: I am working with German texts. And in German all nouns start with the capital letter. So, I think, (correct me if I'm wrong) that this is the exact situation where cased model must be used. (There is also no uncased model available for German in spacy).

But what must be done with data in this situation? Should I (while preprocessing train data) leave it as it is (by that I mean not using the .lower() function) or it doesn't make any difference?

like image 644
Oleg Ivanytskyi Avatar asked May 19 '20 23:05

Oleg Ivanytskyi


People also ask

What is difference between cased and uncased Bert?

In BERT uncased, the text has been lowercased before WordPiece tokenization step while in BERT cased, the text is same as the input text (no changes). For example, if the input is "OpenGenus", then it is converted to "opengenus" for BERT uncased while BERT cased takes in "OpenGenus".

What are the different Bert models?

BERT's Architecture We currently have two variants available: BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters. BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters.


Video Answer


3 Answers

As a non-German-speaker, your comment about nouns being uppercase does make it seem like case is more relevant for German than it might be for English, but that doesn't obviously mean that a cased model will give better performance on all tasks.

For something like part-of-speech detection, case would probably be enormously helpful for the reason you describe, but for something like sentiment analysis, it's less clear whether the added complexity of having a much larger vocabulary is worth the benefits. (As a human, you could probably imagine doing sentiment analysis with all lowercase text just as easily.)

Given that the only model available is the cased version, I would just go with that - I'm sure it will still be one of the best pretrained German models you can get your hands on. Cased models have separate vocab entries for differently-cased words (e.g. in english the and The will be different tokens). So yes, during preprocessing you wouldn't want to remove that information by calling .lower(), just leave the casing as-is.

like image 161
jayelm Avatar answered Oct 20 '22 16:10

jayelm


In simple terms, BERT cased doesn't lowercase the word starting with a capital letter for example in the case of Nouns in the German language.

BERT cased is helpful where the accent plays an important role. For example schön in German

If we convert schön to schon using BERT uncased, it will have a different meaning. schön means beautiful whereas schon means already

like image 20
Shubhesh Swain Avatar answered Oct 20 '22 15:10

Shubhesh Swain


The difference between "BERT cased" and "BERT uncased" can to finded in different contexts. For example, in the dialogs system, the users rarely put the text in their correct form, so, is ordinary to find the words in lower case. Maybe, in this case, the BERT in uncased have an advantage.

like image 1
M_Bueno Avatar answered Oct 20 '22 15:10

M_Bueno