I am currently processing texts with the NLP library Spacy. Spacy, however, does not lemmatize all words correctly, therefore I want to extend the lookup table. Currently I am merging Spacy's constant lookup table with my extension and subsequently overwrite Spacy's native lookup table.
I have the feeling, however, that this approach may not be the best and most consistent one.
Question: Is there another possibility to update the lookup table in Spacy, e.g. an update or extend function? I have read the Docs and could not find something like that. Or is this approach "just fine"?
Working example of my current approach:
import spacy
nlp = spacy.load('de')
Spacy_lookup = spacy.lang.de.LOOKUP
New_lookup = {'AAA':'Anonyme Affen Allianz','BBB':'Berliner Bauern Bund','CCC':'Chaos Chaoten Club'}
Spacy_lookup.update(New_lookup)
spacy.lang.de.LOOKUP = Spacy_lookup
tagged = nlp("Die AAA besiegt die BBB und den CCC unverdient.")
[ print(each.lemma_) for each in tagged]
Die
Anonyme Affen Allianz
besiegen
der
Berliner Bauern Bund
und
der
Chaos Chaoten Club
unverdient
.
Your solutions seems fine.
However, I cleaner workaround would be to take advantage of the custom spaCy pipeline feature. Specifically, you can create a new component that updates the lemma attribute if the token is in your doc and then stack it in your pipeline.
Example code:
import spacy
custom_lookup = {'AAA':'Anonyme Affen Allianz','BBB':'Berliner Bauern Bund','CCC':'Chaos Chaoten Club'}
def change_lemma_property(doc):
for token in doc:
if (token.text in custom_lookup):
token.lemma_ = custom_lookup[token.text]
return doc
nlp = spacy.load('de')
nlp.add_pipe(change_lemma_property, first=True)
text = 'Die AAA besiegt die BBB und den CCC unverdient.'
doc = nlp(text)
[print(x.lemma_) for x in doc]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With