Lowercase Lemmatization with spacy in german

Question

There seems to be a problem with noun singularization with spacy in german. Spacy seems to rely on words to be capitalized to recognize them as nouns. An example:

import spacy
nlp = spacy.load("C:\Users\somepath\spacy\de_core_md\de_core_news_md\de_core_news_md-2.2.5")

def lemmatize_text(text):
    """returns the text with each word in its basic form"""
    doc = nlp(text)
    return [word.lemma_ for word in doc]

lemmatize_text('Das Wort Tests wird erkannt. Allerdings werden tests nicht erkannt')
--> ['der', 'Wort', 'Test', 'werden', 'erkennen', '.', 'Allerdings', 'werden', 'tests', 'nicht', 'erkennen']

# should say 'Test' for both sentences

That would not be a problem if I was lemmatizing the original text right away. However, my preprocessing looks like this:

turn to lowercase
remove punctuation
remove stopwords
lemmatize

Is there a recommended order in which to execute the above steps?

I am not lemmatizing first because words at the beginning of a sentence are then not recognized correctly:

lemmatize_text('Größer wird es nicht mehr. größer wird es nicht mehr.')
--> ['Größer', 'werden', 'ich', 'nicht', 'mehr', '.', 'groß', 'werden', 'ich', 'nicht', 'mehr', '.']

# should say 'groß' for both sentences

Asha_Tir · Accepted Answer

Old thread, but hope to help anyone still looking for answers...

On my content word analysis project, I replace ! and ? with a full stop, so that I can use it as a split point for pulling entire sentences later on.

Then I replace the 6- and 9- shaped apostrophes ’ ‘ with the vertical ' as these somehow creep into my text scans sometimes, and they affect tagging.

Then I use re.findall("[a-zA-ZÄÖÜäöüß]+\'?[a-zA-ZÄÖÜäöüß]+\.?", s) to pull any word, including contractions and last in a sentence into a list.

Then I join it all together with spaces to create a nice clean text.

For tagging German, I use HanTa (Hannover Tagger). https://github.com/wartaal/HanTa/blob/master/Demo.ipynb It seems to be good at guessing upper and lower case, even if they're both entered incorrectly, as you can see in the example.

Install:

 !pip install HanTa

Example:

from nltk import word_tokenize
from HanTa import HanoverTagger as ht

tagger = ht.HanoverTagger('morphmodel_ger.pgz')
sent = "die bäume Wurden große. wie geht's dir, Meinem freund?" 
tokenized_sentence = word_tokenize(sent)
tokens = tagger.tag_sent(tokenized_sentence, taglevel = 1)
for token in tokens:
    print(token[1] + ' / ' + token[2])

Output:

die ART
Baum NN
werden VAFIN
groß ADJA
-- $.
wie PWAV
gehen VVFIN
's PPER
dir PPER
-- $,
mein PPOSAT
Freund NN
-- $

If it's just content words that you want:

Tags starting with 'NN' are nouns
Tags starting with 'V' are verbs
Tags starting with 'ADJ' are adjectives
Tags starting with 'ADV' are adverbs

Lowercase Lemmatization with spacy in german

Tags:

python

nlp

spacy

Justin P.

1 Answers

Asha_Tir

Recent Activity

Donate For Us

Lowercase Lemmatization with spacy in german

Tags:

python

nlp

spacy

Justin P.

1 Answers

Asha_Tir

Related questions

Recent Activity

Donate For Us