How to search for (separable) phrases in text

Question

I'm looking for a way to search for a phrase or an idiomatic expression in a text, regardless of tense or possible prepositions / adverbs, e.g. if I'm looking for

call off

I would also like to find usages like

My boss called the meeting off.

Is this possible (using spacy)? If so, what feature or ability of NLP am I looking for?

David Dale · Accepted Answer

Yes, you can do it with spacy: you need a dependency parser to detect relations between words, and lemmatizer to find normal form of these words. And spacy has both.

Dependency parser shows syntactic relations between pairs of words, like here:

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp('My boss called the meeting off.')
displacy.render(doc, style="dep", jupyter=True)

enter image description here

Idiomatic expressions tend to be represented by compact subtrees of such syntactic trees, characterized by specific relations between them. In different sentences the exact form and position of the words which are part of the idiom may vary, but the relation between them stays the same.

When we search for an expression, we can actually loop over all the words in the document, looking for a word with normal form "call" that has a connected ("child") word with dependency "prt" and normal form "off":

def detect_collocations(doc, parent_lemma, dep, child_lemma):
    """ Create a generator of all occurences of collocation in a document.
    The elements of generator are all pairs of tokens with lemmas `parent_lemma` and `child_lemma`
    and dependency of type `dep` between them that are found in a spacy document `doc`.
    """
    for token in doc:
        if token.lemma_ == parent_lemma:
            for child in token.children:
                if child.dep_ == dep and child.lemma_ == child_lemma:
                    yield token, child

result = list(detect_collocations(doc, 'call', 'prt', 'off'))
print(result)
# [(called, off)]

Because the function above returns pairs of spacy.Token objects, you can extract meta data from them, e.g. their positions to highlight them in the text:

positions = {t.idx for pair in result for t in pair}
for token in doc:
    print('_{}_'.format(token) if token.idx in positions else token, end=' ')
# My boss _called_ the meeting _off_ .

Here is a colab notebook you can play with.

Matthias Winkelmann · Answer

Here's @Sofie Vl's idea in code.

Installing the pre-release version of spacy and the language model that works with it:

!pip install spacy-nightly
!python -m spacy download en_core_web_sm

At this point, you may need to restart the runtime.

import spacy
nlp = spacy.load("en_core_web_sm")
from spacy.matcher import DependencyMatcher

matcher = DependencyMatcher(nlp.vocab)

This is where the magic happens. LEMMA matches, well, lemmas, but there are other comparisons, such as ```ORTH`` with requires a complete (orthographic) match,etc.

pattern = [
  {
    "RIGHT_ID": "call",
    "RIGHT_ATTRS": {"LEMMA": "call"}
  },
  {
    "LEFT_ID": "call",
    "REL_OP": ">",
    "RIGHT_ID": "off",
    "RIGHT_ATTRS": {"DEP": "prt", "LEMMA": "off"}
  }
]

Register pattern, run it & show results

matcher.add("called off", [pattern])

doc = nlp("There won't be any calling them off.")

result = matcher(doc)

positions = {t for pattern, pair in result for t in pair}
for token in doc:
    print('_{}_'.format(token) if token.i in positions else token, end=' ')

Result, as above:

# There wo n't be any _calling_ them _off_ .

How to search for (separable) phrases in text

Tags:

nlp

spacy

dreo

2 Answers

David Dale

Matthias Winkelmann

Recent Activity

Donate For Us

How to search for (separable) phrases in text

Tags:

nlp

spacy

dreo

2 Answers

David Dale

Matthias Winkelmann

Related questions

Recent Activity

Donate For Us