Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP - When to lowercase text during preprocessing

I want to build a model for language modelling, which should predict the next words in a sentence, given the previous word(s) and/or the previous sentence.

Use case: I want to automate writing reports. So the model should automatically complete the sentence I am writing. Therefore, it is important that nouns and the words at the beginning of a sentence are capitalized.

Data: The data is in German and contains a lot of technical jargon.

My text corpus is in German and I am currently working on the preprocessing. Because my model should predict gramatically correct sentences I have decided to use/not use the following preprocessing steps:

  • no stopword removal
  • no lemmatization

  • replace all expressions with numbers by NUMBER

  • normalisation of synonyms and abbreviations
  • replace rare words with RARE

However, I am not sure whether to convert the corpus to lowercase. When searching the web I found different opinions. Although lower-casing is quite common it will cause my model to wrongly predict the capitalization of nouns, sentence beginnings etc.

I also found the idea to convert only the words at the beginning of a sentence to lower-case on the following Stanford page.

What is the best strategy for this use-case? Should I convert the text to lower-case and change the words to the correct case after prediction? Should I leave the capitalization as it is? Should I only lowercase words at the beginning of a sentence?

Thanks a lot for any suggestions and experiences!

like image 483
Lemon Avatar asked Aug 24 '17 07:08

Lemon


3 Answers

I think for your particular use-case, it would be better to convert it to lowercase because ultimately, you will need to predict the words given a certain context. You probably won't be needing to predict sentence beginnings in your use-case. Also, if a noun is predicted you can capitalize it later. However consider the other way round. (Assuming your corpus is in English) Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy. Whereas I think, lowering the words would be a better trade off. I did a project on Question-answering system and converting the text to lowercase was a good trade off.

Edit : Since your corpus is in German, it would be better to retain the capitalization since it is an important aspect of German Language.

If it is of any help, Spacey supports German Language. You use it to train your model.

like image 118
Gambit1614 Avatar answered Oct 23 '22 02:10

Gambit1614


In general, tRuEcasIng helps. Truecasing is the process of restoring case information to badly-cased or noncased text.

See

  • How can I best determine the correct capitalization for a word?
  • https://github.com/nreimers/truecaser
  • https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
like image 20
alvas Avatar answered Oct 23 '22 01:10

alvas


definitely convert the majority of the words to lowercase, cut consider the following cases:

  1. Acronyms e.g. MIT if you lower case it to mit which is a word (in German) you'll be in trouble
  2. Initials e.g. J. A. Snow
  3. Enumerations e.g. (I),(II),(III),APPENDIX A

I would also advise against the <RARE> token, what percentage of your corpus is <RARE>, what about unknown words ?

Since you are dealing with German, and words can be arbitrary long and rare, you might need a way to break them down further. Thus some sort of lemmatization and tokenization are needed

I recommend using spacy that support German from day one, and the support and docs are very helpful (Thank you Mathew and Ines)

like image 2
Uri Goren Avatar answered Oct 23 '22 01:10

Uri Goren