Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect stopword after lemma in Spacy

How to detect if word is a stopword after stemming and lemmatization in spaCy?

Assume sentence

s = "something good\nsomethings 2 bad"

In this case something is a stopword. Obviously (to me?) Something and somethings are also stopwords, but it needs to stemmed before. Following script will say that the first is true, but latter isn't.

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en')
tokenizer = Tokenizer(nlp.vocab)

s = "something good\nSomething 2 somethings"
tokens = tokenizer(s)

for token in tokens:
  print(token.lemma_, token.is_stop)

Returns:

something True
good False
"\n" False
Something False
2 False
somethings False

Is there a way to detect that through spaCy API?

like image 332
Dawid Laszuk Avatar asked Jan 29 '23 03:01

Dawid Laszuk


1 Answers

Stop words in spaCy are just a set of strings which set a flag on the lexemes, the context-independent entries in the vocabulary (see here for the English stop list). The flag simply checks whether text in STOP_WORDS, which is why "something" returns True for is_stop, and "somethings" doesn't.

However, what you can do is check if the token's lemma or lowercase form is part of the stop list, which is available via nlp.Defaults.stop_words (i.e. the defaults of the language you're using):

def extended_is_stop(token):
    stop_words = nlp.Defaults.stop_words
    return token.is_stop or token.lower_ in stop_words or token.lemma_ in stop_words

If you're using spaCy v2.0 and want to solve this even more elegantly, you could also implement your own is_stop function via a custom Token attribute extension. You can choose any name for your attribute and it will become available via token._., for example token._.is_stop:

from spacy.tokens import Token
from spacy.lang.en.stop_words import STOP_WORDS  # import stop words from language data

stop_words_getter = lambda token: token.is_stop or token.lower_ in STOP_WORDS or token.lemma_ in STOP_WORDS
Token.set_extension('is_stop', getter=stop_words_getter)  # set attribute with getter

nlp = spacy.load('en')
doc = nlp("something Something somethings")
assert doc[0]._.is_stop  # this was a stop word before, and still is
assert doc[1]._.is_stop  # this is now also a stop word, because its lowercase form is
assert doc[2]._.is_stop  # this is now also a stop word, because its lemma is
like image 54
Ines Montani Avatar answered Jan 31 '23 20:01

Ines Montani