Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search for (separable) phrases in text

Tags:

nlp

spacy

I'm looking for a way to search for a phrase or an idiomatic expression in a text, regardless of tense or possible prepositions / adverbs, e.g. if I'm looking for

call off
I would also like to find usages like
My boss called the meeting off.

Is this possible (using spacy)? If so, what feature or ability of NLP am I looking for?

like image 522
dreo Avatar asked Oct 16 '25 18:10

dreo


2 Answers

Yes, you can do it with spacy: you need a dependency parser to detect relations between words, and lemmatizer to find normal form of these words. And spacy has both.

Dependency parser shows syntactic relations between pairs of words, like here:

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('My boss called the meeting off.')
displacy.render(doc, style="dep", jupyter=True)

enter image description here

Idiomatic expressions tend to be represented by compact subtrees of such syntactic trees, characterized by specific relations between them. In different sentences the exact form and position of the words which are part of the idiom may vary, but the relation between them stays the same.

When we search for an expression, we can actually loop over all the words in the document, looking for a word with normal form "call" that has a connected ("child") word with dependency "prt" and normal form "off":

def detect_collocations(doc, parent_lemma, dep, child_lemma):
    """ Create a generator of all occurences of collocation in a document.
    The elements of generator are all pairs of tokens with lemmas `parent_lemma` and `child_lemma`
    and dependency of type `dep` between them that are found in a spacy document `doc`.
    """
    for token in doc:
        if token.lemma_ == parent_lemma:
            for child in token.children:
                if child.dep_ == dep and child.lemma_ == child_lemma:
                    yield token, child

result = list(detect_collocations(doc, 'call', 'prt', 'off'))
print(result)
# [(called, off)]

Because the function above returns pairs of spacy.Token objects, you can extract meta data from them, e.g. their positions to highlight them in the text:

positions = {t.idx for pair in result for t in pair}
for token in doc:
    print('_{}_'.format(token) if token.idx in positions else token, end=' ')
# My boss _called_ the meeting _off_ . 

Here is a colab notebook you can play with.

like image 137
David Dale Avatar answered Oct 19 '25 13:10

David Dale


Here's @Sofie Vl's idea in code.

Installing the pre-release version of spacy and the language model that works with it:

!pip install spacy-nightly
!python -m spacy download en_core_web_sm

At this point, you may need to restart the runtime.

import spacy
nlp = spacy.load("en_core_web_sm")
from spacy.matcher import DependencyMatcher

matcher = DependencyMatcher(nlp.vocab)

This is where the magic happens. LEMMA matches, well, lemmas, but there are other comparisons, such as ```ORTH`` with requires a complete (orthographic) match,etc.

pattern = [
  {
    "RIGHT_ID": "call",
    "RIGHT_ATTRS": {"LEMMA": "call"}
  },
  {
    "LEFT_ID": "call",
    "REL_OP": ">",
    "RIGHT_ID": "off",
    "RIGHT_ATTRS": {"DEP": "prt", "LEMMA": "off"}
  }
]

Register pattern, run it & show results

matcher.add("called off", [pattern])

doc = nlp("There won't be any calling them off.")

result = matcher(doc)

positions = {t for pattern, pair in result for t in pair}
for token in doc:
    print('_{}_'.format(token) if token.i in positions else token, end=' ')

Result, as above:

# There wo n't be any _calling_ them _off_ . 
like image 35
Matthias Winkelmann Avatar answered Oct 19 '25 14:10

Matthias Winkelmann