NLP

Question

I want to build a model for language modelling, which should predict the next words in a sentence, given the previous word(s) and/or the previous sentence.

Use case: I want to automate writing reports. So the model should automatically complete the sentence I am writing. Therefore, it is important that nouns and the words at the beginning of a sentence are capitalized.

Data: The data is in German and contains a lot of technical jargon.

My text corpus is in German and I am currently working on the preprocessing. Because my model should predict gramatically correct sentences I have decided to use/not use the following preprocessing steps:

no stopword removal
no lemmatization
replace all expressions with numbers by NUMBER
normalisation of synonyms and abbreviations
replace rare words with RARE

However, I am not sure whether to convert the corpus to lowercase. When searching the web I found different opinions. Although lower-casing is quite common it will cause my model to wrongly predict the capitalization of nouns, sentence beginnings etc.

I also found the idea to convert only the words at the beginning of a sentence to lower-case on the following Stanford page.

What is the best strategy for this use-case? Should I convert the text to lower-case and change the words to the correct case after prediction? Should I leave the capitalization as it is? Should I only lowercase words at the beginning of a sentence?

Thanks a lot for any suggestions and experiences!

Gambit1614 · Accepted Answer

I think for your particular use-case, it would be better to convert it to lowercase because ultimately, you will need to predict the words given a certain context. You probably won't be needing to predict sentence beginnings in your use-case. Also, if a noun is predicted you can capitalize it later. However consider the other way round. (Assuming your corpus is in English) Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy. Whereas I think, lowering the words would be a better trade off. I did a project on Question-answering system and converting the text to lowercase was a good trade off.

Edit : Since your corpus is in German, it would be better to retain the capitalization since it is an important aspect of German Language.

If it is of any help, Spacey supports German Language. You use it to train your model.

Edit : Since your corpus is in German, it would be better to retain the capitalization since it is an important aspect of German Language.

If it is of any help, Spacey supports German Language. You use it to train your model.

alvas · Answer

In general, tRuEcasIng helps. Truecasing is the process of restoring case information to badly-cased or noncased text.

See

How can I best determine the correct capitalization for a word?
https://github.com/nreimers/truecaser
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl

Uri Goren · Answer

definitely convert the majority of the words to lowercase, cut consider the following cases:

Acronyms e.g. MIT if you lower case it to mit which is a word (in German) you'll be in trouble
Initials e.g. J. A. Snow
Enumerations e.g. (I),(II),(III),APPENDIX A

I would also advise against the <RARE> token, what percentage of your corpus is <RARE>, what about unknown words ?

Since you are dealing with German, and words can be arbitrary long and rare, you might need a way to break them down further. Thus some sort of lemmatization and tokenization are needed

I recommend using spacy that support German from day one, and the support and docs are very helpful (Thank you Mathew and Ines)

NLP - When to lowercase text during preprocessing

Tags:

python

machine-learning

nltk

Lemon

3 Answers

Gambit1614

alvas

Uri Goren

Recent Activity

Donate For Us