Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spacy Entity from PhraseMatcher only

Tags:

nlp

spacy

I'm using Spacy for a NLP project. I have a list of phrases I'd like to mark as a new entity type. I originally tried training a NER model but since there's a finite terminology list, I think simply using a Matcher should be easier. I see in the documentation that you can add entities to a document based on a Matcher. My question is: how do I do this for a new entity and not have the NER pipe label any other tokens as this entity? Ideally only tokens found via my matcher should be marked as the entity but I need to add it as a label to the NER model which then ends up labeling some as the entity.

Any suggestions on how to best accomplish this? Thanks!

like image 289
kevin.w.johnson Avatar asked Dec 10 '22 07:12

kevin.w.johnson


1 Answers

I think you might want to implement something similar to this example – i.e. a custom pipeline component that uses the PhraseMatcher and assigns entities. spaCy's built-in entity recognizer is also just a pipeline component – so you can remove it from the pipeline and add your custom component instead:

nlp = spacy.load('en')               # load some model
nlp.remove_pipe('ner')               # remove the entity recognizer
entity_matcher = EntityMatcher(nlp)  # use your own entity matcher component
nlp.add_pipe(entity_matcher)         # add it to the pipeline

Your entity matcher component could then look something like this:

from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for label, start, end in matches:
            span = Span(doc, start, end, label=label)
            spans.append(span)
        doc.ents = spans
        return doc

When your component is initialised, it creates match patterns for your terms, and adds them to the phrase matcher. My example assumes that you have a list of terms and a label you want to assign for those terms:

entity_matcher = EntityMatcher(nlp, your_list_of_terms, 'SOME_LABEL')
nlp.add_pipe(entity_matcher)

print(nlp.pipe_names)  # see all components in the pipeline

When you call nlp on a string of text, spaCy will tokenize text text to create a Doc object and call the individual pipeline components on the Doc in order. Your custom component's __call__ method then finds matches in the document, creates a Span for each of them (which allows you to assign a custom label) and finally, adds them to the doc.ents property and returns the Doc.

You can structure your pipeline component however you like – for example, you could extend it to load in your terminology list from a file or make it add multiple rules for different labels to the PhraseMatcher.

like image 73
Ines Montani Avatar answered Jan 03 '23 01:01

Ines Montani