What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?

Tags:

I'm currently working on replacing a system based on nltk entity extraction combined with regexp matching where I have several named entity dictionaries. The dictionary entities are both of common type (PERSON (employees) etc.) as well as custom types (e.g. SKILL). I want to use the pre-trained spaCy model and include my dictionaries somehow, to increase the NER accuracy. Here are my thoughts on possible methods:

Use spaCy's Matcher API, iterate through the dictionary and add each phrase with a callback to add the entity?
I've just found spacy-lookup, which seems like an easy way to provide long lists of words/phrases to match.
But what if I want to have fuzzy matching? Is there a way to add directly to the Vocab and thus have some fuzzy matching through Bloom filter / n-gram word vectors, or is there some extension out there that suits this need? Otherwise I guess I could copy spacy-lookup and replace the flashtext machinery with something else, e.g. Levenshtein distance.
While playing around with spaCy I did try just training the NER directly with a single word from the dictionary (without any sentence context), and this did "work". But I would, of course, have to take much care to keep the model from forgetting everything.

Any help appreciated, I feel like this must be a pretty common requirement and would love to hear what's working best for people out there.

316

asked Feb 14 '18 09:02

Einar Magnússon

2 Answers

I would recommend looking at spaCy's Entity Ruler. If you convert your existing dictionary into the schema for matching, you can add rules for each of your entities and new types.

This is quite powerful because you can combine it with the existing statistical NER available in a standard spacy model to achieve some of the "fuzzy matching" you mention. From the docs:

The entity ruler is designed to integrate with spaCy’s existing statistical models and enhance the named entity recognizer. If it’s added before the "ner" component, the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it’s added after the "ner" component, the entity ruler will only add spans to the doc.ents if they don’t overlap with existing entities predicted by the model. To overwrite overlapping entities, you can set overwrite_ents=True on initialization.

148

answered Oct 21 '22 22:10

pmbaumgartner

I use the Matcher with dynamically generated callbacks. I think it works well.

I got curious why the Matcher doesn't support fuzzy matching, and found this comment by the author of spacy on a closed issue.

You really want to precompute the search sets, rather than do them on-the-fly in the matcher. Once you've precomputed the similarity values, you can use extension attributes and a >= comparison in the Matcher to perform the search. I think this is a case where the implementation details strongly matter, and an API that obscures them would actually be a disservice.

I think this is a good point, and it tells you how to build what you want.

answered Oct 21 '22 22:10

Sam H.

Related questions
                            
                                Difference between list comprehension and generator comprehension with `yield` inside
                            
                                Computing symmetric Kullback-Leibler divergence between two documents
                            
                                How to connect django to docker redis container?
                            
                                OpenCV - specify format while writing image to file (cv2.imwrite)
                            
                                How to plot Pandas datetime series in Seaborn distplot?
                            
                                Difference between numpy ediff1d and diff
                            
                                How can I get a similar summary of a Pandas dataframe as in R?
                            
                                How to determine the number of interned strings in Python 2.7.5?
                            
                                Is there a way compile protocol buffers into pure python code?
                            
                                Reading csv from S3 and inserting into a MySQL table with AWS Lambda
                            
                                How to establish a SSH connection via proxy using Fabric?
                            
                                TensorFlow tf.reshape Fortran order (like numpy)
                            
                                It is possible to generate sequence diagram from python code?
                            
                                CeleryBeat Process consumes all OS memory
                            
                                Pylint message about module length reasoning and ratio of docstrings to lines of code
                            
                                Beautiful Soup Select Vs Find_all data Type
                            
                                Starting Kivy service on bootup (Android)
                            
                                How to interpret output of .predict() from fitted scikit-survival model in python?
                            
                                Not able Running/deploying custom script with shub-image
                            
                                Tensorflow Object Detection API on Windows - error "ModuleNotFoundError: No module named 'utils'"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the ideal way to include dictionaries (gazetteer) in spaCy to improve NER?

Tags:

python

nlp

named-entity-recognition

spacy

Einar Magnússon

People also ask

2 Answers

pmbaumgartner

Sam H.

Recent Activity

Donate For Us