I'm using Spacy for a NLP project. I have a list of phrases I'd like to mark as a new entity type. I originally tried training a NER model but since there's a finite terminology list, I think simply using a Matcher should be easier. I see in the documentation that you can add entities to a document based on a Matcher. My question is: how do I do this for a new entity and not have the NER pipe label any other tokens as this entity? Ideally only tokens found via my matcher should be marked as the entity but I need to add it as a label to the NER model which then ends up labeling some as the entity.
Any suggestions on how to best accomplish this? Thanks!
I think you might want to implement something similar to this example – i.e. a custom pipeline component that uses the PhraseMatcher
and assigns entities. spaCy's built-in entity recognizer is also just a pipeline component – so you can remove it from the pipeline and add your custom component instead:
nlp = spacy.load('en') # load some model
nlp.remove_pipe('ner') # remove the entity recognizer
entity_matcher = EntityMatcher(nlp) # use your own entity matcher component
nlp.add_pipe(entity_matcher) # add it to the pipeline
Your entity matcher component could then look something like this:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
class EntityMatcher(object):
name = 'entity_matcher'
def __init__(self, nlp, terms, label):
patterns = [nlp(term) for term in terms]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add(label, None, *patterns)
def __call__(self, doc):
matches = self.matcher(doc)
spans = []
for label, start, end in matches:
span = Span(doc, start, end, label=label)
spans.append(span)
doc.ents = spans
return doc
When your component is initialised, it creates match patterns for your terms, and adds them to the phrase matcher. My example assumes that you have a list of terms
and a label
you want to assign for those terms:
entity_matcher = EntityMatcher(nlp, your_list_of_terms, 'SOME_LABEL')
nlp.add_pipe(entity_matcher)
print(nlp.pipe_names) # see all components in the pipeline
When you call nlp
on a string of text, spaCy will tokenize text text to create a Doc
object and call the individual pipeline components on the Doc
in order. Your custom component's __call__
method then finds matches in the document, creates a Span
for each of them (which allows you to assign a custom label) and finally, adds them to the doc.ents
property and returns the Doc
.
You can structure your pipeline component however you like – for example, you could extend it to load in your terminology list from a file or make it add multiple rules for different labels to the PhraseMatcher
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With