I want to use spacy
's pretrained BERT model for text classification but I'm a little confused about cased/uncased
models. I read somewhere that cased
models should only be used when there is a chance that letter casing will be helpful for the task. In my specific case: I am working with German texts. And in German all nouns start with the capital letter. So, I think, (correct me if I'm wrong) that this is the exact situation where cased
model must be used. (There is also no uncased
model available for German in spacy
).
But what must be done with data in this situation?
Should I (while preprocessing train data) leave it as it is (by that I mean not using the .lower()
function) or it doesn't make any difference?
In BERT uncased, the text has been lowercased before WordPiece tokenization step while in BERT cased, the text is same as the input text (no changes). For example, if the input is "OpenGenus", then it is converted to "opengenus" for BERT uncased while BERT cased takes in "OpenGenus".
BERT's Architecture We currently have two variants available: BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters. BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters.
As a non-German-speaker, your comment about nouns being uppercase does make it seem like case is more relevant for German than it might be for English, but that doesn't obviously mean that a cased model will give better performance on all tasks.
For something like part-of-speech detection, case would probably be enormously helpful for the reason you describe, but for something like sentiment analysis, it's less clear whether the added complexity of having a much larger vocabulary is worth the benefits. (As a human, you could probably imagine doing sentiment analysis with all lowercase text just as easily.)
Given that the only model available is the cased version, I would just go with that - I'm sure it will still be one of the best pretrained German models you can get your hands on. Cased models have separate vocab entries for differently-cased words (e.g. in english the
and The
will be different tokens). So yes, during preprocessing you wouldn't want to remove that information by calling .lower()
, just leave the casing as-is.
In simple terms, BERT cased doesn't lowercase the word starting with a capital letter for example in the case of Nouns in the German language.
BERT cased is helpful where the accent plays an important role. For example schön in German
If we convert schön to schon using BERT uncased, it will have a different meaning. schön means beautiful whereas schon means already
The difference between "BERT cased" and "BERT uncased" can to finded in different contexts. For example, in the dialogs system, the users rarely put the text in their correct form, so, is ordinary to find the words in lower case. Maybe, in this case, the BERT in uncased have an advantage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With