Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python spacy looking for two (or more) words in a window

I am trying to identify concepts in texts. Oftentimes I consider that a concept appears in a text when two or more words appear relatively close to each other. For instance a concept would be any of the words forest, tree, nature in a distance less than 4 words from fire, burn, overheat

I am learning spacy and so far I can use the matcher like this:

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],[{"LOWER": "hello"}, {"LOWER": "world"}])

That would match hello world and hello, world (or tree firing for the above mentioned example)

I am looking for a solution that would yield matches of the words Hello and World within a window of 5 words.

I had a look into: https://spacy.io/usage/rule-based-matching

and the operators there described, but I am not able to put this word-window approach in "spacy" syntax.

Furthermore, I am not able to generalize that to more words as well.

Some ideas? Thanks

like image 968
JFerro Avatar asked Apr 13 '26 01:04

JFerro


1 Answers

For a window with K words, where K is relatively small, you can add K-2 optional wildcard tokens between your words. Wildcard means "any symbol", and in Spacy terms it is just an empty dict. Optional means the token may be there or may not, and in Spacy in is encoded as {"OP": "?"}.

Thus, you can write your matcher as

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"OP": "?"},  {"OP": "?"}, {"OP": "?"}, {"LOWER": "world"}])

which means you look for "hello", then 0 to 3 tokens of any kind, then "world". For example, for

doc = nlp(u"Hello brave new world")
for match_id, start, end in matcher(doc):
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(match_id, string_id, start, end, span.text)

it will print you

15578876784678163569 HelloWorld 0 4 Hello brave new world

And if you want to match the other order (world ? ? ? hello) as well, you need to add the second, symmetric pattern into your matcher.

like image 193
David Dale Avatar answered Apr 14 '26 17:04

David Dale



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!