There seems to be a problem with noun singularization with spacy in german. Spacy seems to rely on words to be capitalized to recognize them as nouns. An example:
import spacy
nlp = spacy.load("C:\\Users\\somepath\\spacy\\de_core_md\\de_core_news_md\\de_core_news_md-2.2.5")
def lemmatize_text(text):
"""returns the text with each word in its basic form"""
doc = nlp(text)
return [word.lemma_ for word in doc]
lemmatize_text('Das Wort Tests wird erkannt. Allerdings werden tests nicht erkannt')
--> ['der', 'Wort', 'Test', 'werden', 'erkennen', '.', 'Allerdings', 'werden', 'tests', 'nicht', 'erkennen']
# should say 'Test' for both sentences
That would not be a problem if I was lemmatizing the original text right away. However, my preprocessing looks like this:
Is there a recommended order in which to execute the above steps?
I am not lemmatizing first because words at the beginning of a sentence are then not recognized correctly:
lemmatize_text('Größer wird es nicht mehr. größer wird es nicht mehr.')
--> ['Größer', 'werden', 'ich', 'nicht', 'mehr', '.', 'groß', 'werden', 'ich', 'nicht', 'mehr', '.']
# should say 'groß' for both sentences
Old thread, but hope to help anyone still looking for answers...
On my content word analysis project, I replace !
and ?
with a full stop, so that I can use it as a split point for pulling entire sentences later on.
Then I replace the 6- and 9- shaped apostrophes ’ ‘
with the vertical '
as these somehow creep into my text scans sometimes, and they affect tagging.
Then I use re.findall("[a-zA-ZÄÖÜäöüß]+\\'?[a-zA-ZÄÖÜäöüß]+\.?", s)
to pull any word, including contractions and last in a sentence into a list.
Then I join it all together with spaces to create a nice clean text.
For tagging German, I use HanTa (Hannover Tagger). https://github.com/wartaal/HanTa/blob/master/Demo.ipynb It seems to be good at guessing upper and lower case, even if they're both entered incorrectly, as you can see in the example.
Install:
!pip install HanTa
Example:
from nltk import word_tokenize
from HanTa import HanoverTagger as ht
tagger = ht.HanoverTagger('morphmodel_ger.pgz')
sent = "die bäume Wurden große. wie geht's dir, Meinem freund?"
tokenized_sentence = word_tokenize(sent)
tokens = tagger.tag_sent(tokenized_sentence, taglevel = 1)
for token in tokens:
print(token[1] + ' / ' + token[2])
Output:
die ART
Baum NN
werden VAFIN
groß ADJA
-- $.
wie PWAV
gehen VVFIN
's PPER
dir PPER
-- $,
mein PPOSAT
Freund NN
-- $
If it's just content words that you want:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With