Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lowercase Lemmatization with spacy in german

Tags:

python

nlp

spacy

There seems to be a problem with noun singularization with spacy in german. Spacy seems to rely on words to be capitalized to recognize them as nouns. An example:

import spacy
nlp = spacy.load("C:\\Users\\somepath\\spacy\\de_core_md\\de_core_news_md\\de_core_news_md-2.2.5")

def lemmatize_text(text):
    """returns the text with each word in its basic form"""
    doc = nlp(text)
    return [word.lemma_ for word in doc]

lemmatize_text('Das Wort Tests wird erkannt. Allerdings werden tests nicht erkannt')
--> ['der', 'Wort', 'Test', 'werden', 'erkennen', '.', 'Allerdings', 'werden', 'tests', 'nicht', 'erkennen']

# should say 'Test' for both sentences

That would not be a problem if I was lemmatizing the original text right away. However, my preprocessing looks like this:

  1. turn to lowercase
  2. remove punctuation
  3. remove stopwords
  4. lemmatize

Is there a recommended order in which to execute the above steps?

I am not lemmatizing first because words at the beginning of a sentence are then not recognized correctly:

lemmatize_text('Größer wird es nicht mehr. größer wird es nicht mehr.')
--> ['Größer', 'werden', 'ich', 'nicht', 'mehr', '.', 'groß', 'werden', 'ich', 'nicht', 'mehr', '.']

# should say 'groß' for both sentences
like image 206
Justin P. Avatar asked Sep 07 '25 21:09

Justin P.


1 Answers

Old thread, but hope to help anyone still looking for answers...

On my content word analysis project, I replace ! and ? with a full stop, so that I can use it as a split point for pulling entire sentences later on.

Then I replace the 6- and 9- shaped apostrophes ’ ‘ with the vertical ' as these somehow creep into my text scans sometimes, and they affect tagging.

Then I use re.findall("[a-zA-ZÄÖÜäöüß]+\\'?[a-zA-ZÄÖÜäöüß]+\.?", s) to pull any word, including contractions and last in a sentence into a list.

Then I join it all together with spaces to create a nice clean text.

For tagging German, I use HanTa (Hannover Tagger). https://github.com/wartaal/HanTa/blob/master/Demo.ipynb It seems to be good at guessing upper and lower case, even if they're both entered incorrectly, as you can see in the example.

Install:

 !pip install HanTa

Example:

from nltk import word_tokenize
from HanTa import HanoverTagger as ht

tagger = ht.HanoverTagger('morphmodel_ger.pgz')
sent = "die bäume Wurden große. wie geht's dir, Meinem freund?" 
tokenized_sentence = word_tokenize(sent)
tokens = tagger.tag_sent(tokenized_sentence, taglevel = 1)
for token in tokens:
    print(token[1] + ' / ' + token[2])

Output:

die ART
Baum NN
werden VAFIN
groß ADJA
-- $.
wie PWAV
gehen VVFIN
's PPER
dir PPER
-- $,
mein PPOSAT
Freund NN
-- $

If it's just content words that you want:

  • Tags starting with 'NN' are nouns
  • Tags starting with 'V' are verbs
  • Tags starting with 'ADJ' are adjectives
  • Tags starting with 'ADV' are adverbs
like image 58
Asha_Tir Avatar answered Sep 09 '25 10:09

Asha_Tir