Cased VS uncased BERT models in spacy and train data

Tags:

I want to use spacy's pretrained BERT model for text classification but I'm a little confused about cased/uncased models. I read somewhere that cased models should only be used when there is a chance that letter casing will be helpful for the task. In my specific case: I am working with German texts. And in German all nouns start with the capital letter. So, I think, (correct me if I'm wrong) that this is the exact situation where cased model must be used. (There is also no uncased model available for German in spacy).

But what must be done with data in this situation? Should I (while preprocessing train data) leave it as it is (by that I mean not using the .lower() function) or it doesn't make any difference?

644

asked May 19 '20 23:05

Oleg Ivanytskyi

Video Answer

3 Answers

As a non-German-speaker, your comment about nouns being uppercase does make it seem like case is more relevant for German than it might be for English, but that doesn't obviously mean that a cased model will give better performance on all tasks.

For something like part-of-speech detection, case would probably be enormously helpful for the reason you describe, but for something like sentiment analysis, it's less clear whether the added complexity of having a much larger vocabulary is worth the benefits. (As a human, you could probably imagine doing sentiment analysis with all lowercase text just as easily.)

Given that the only model available is the cased version, I would just go with that - I'm sure it will still be one of the best pretrained German models you can get your hands on. Cased models have separate vocab entries for differently-cased words (e.g. in english the and The will be different tokens). So yes, during preprocessing you wouldn't want to remove that information by calling .lower(), just leave the casing as-is.

161

answered Oct 20 '22 16:10

jayelm

In simple terms, BERT cased doesn't lowercase the word starting with a capital letter for example in the case of Nouns in the German language.

BERT cased is helpful where the accent plays an important role. For example schön in German

If we convert schön to schon using BERT uncased, it will have a different meaning. schön means beautiful whereas schon means already

answered Oct 20 '22 15:10

Shubhesh Swain

The difference between "BERT cased" and "BERT uncased" can to finded in different contexts. For example, in the dialogs system, the users rarely put the text in their correct form, so, is ordinary to find the words in lower case. Maybe, in this case, the BERT in uncased have an advantage.

answered Oct 20 '22 15:10

M_Bueno

Related questions
                            
                                Python: passing argument to generator object created by generator expression?
                            
                                Combining iloc and loc
                            
                                Django Rest Framework - Using Session and Token Auth
                            
                                Work on multiple branches with Flask-Migrate
                            
                                How to limit number of CPU's used by a python script w/o terminal or multiprocessing library?
                            
                                Using axvspan for date ranges in matplotlib
                            
                                how to get all dates of week based on week number in python
                            
                                How to normalize a relative path using pathlib
                            
                                Can I execute a function in "apply" to pandas dataframe asynchronously?
                            
                                How to run a nested loop in python inside list such that the outer loop starts from the next element of the list always and so on
                            
                                How to use Dynamic Time warping with kNN in python
                            
                                Module Not Found Error: No module named 'src'
                            
                                `loss` passed to Optimizer.compute_gradients should be a function when eager execution is enabled
                            
                                Is it possible to check chromedriver.exe version at runtime in python?
                            
                                Python Sphinx css not working on github pages
                            
                                Does await always give other tasks a chance to execute?
                            
                                Difference between Keras' BatchNormalization and PyTorch's BatchNorm2d?
                            
                                Pandas read_csv error due to pandas.io.common not importing is_url in 1.0.x
                            
                                How to run python3 code in VSCode? /bin/sh: 1: python: not found
                            
                                Can't run IDLE with pyenv installation: `Python may not be configured for Tk` `ModuleNotFoundError: No module named _tkinter'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cased VS uncased BERT models in spacy and train data

Tags:

python

spacy

bert-language-model