I'm currently working on replacing a system based on nltk entity extraction combined with regexp matching where I have several named entity dictionaries. The dictionary entities are both of common type (PERSON (employees) etc.) as well as custom types (e.g. SKILL). I want to use the pre-trained spaCy model and include my dictionaries somehow, to increase the NER accuracy. Here are my thoughts on possible methods:
Use spaCy's Matcher API, iterate through the dictionary and add each phrase with a callback to add the entity?
I've just found spacy-lookup, which seems like an easy way to provide long lists of words/phrases to match.
But what if I want to have fuzzy matching? Is there a way to add directly to the Vocab and thus have some fuzzy matching through Bloom filter / n-gram word vectors, or is there some extension out there that suits this need? Otherwise I guess I could copy spacy-lookup and replace the flashtext machinery with something else, e.g. Levenshtein distance.
While playing around with spaCy I did try just training the NER directly with a single word from the dictionary (without any sentence context), and this did "work". But I would, of course, have to take much care to keep the model from forgetting everything.
Any help appreciated, I feel like this must be a pretty common requirement and would love to hear what's working best for people out there.
spaCy has its own deep learning library called thinc used under the hood for different NLP models. for most (if not all) tasks, spaCy uses a deep neural network based on CNN with a few tweaks.
Definition of spaCy ner. SpaCy ner is nothing but the named entity recognition in python. The most important, or, as we like to call it, the first stage in Information Retrieval is NER. The practice of extracting essential and usable data sources is known as information retrieval.
The recommended way to train your spaCy pipelines is via the spacy train command on the command line. It only needs a single config. cfg configuration file that includes all settings and hyperparameters.
I would recommend looking at spaCy's Entity Ruler. If you convert your existing dictionary into the schema for matching, you can add rules for each of your entities and new types.
This is quite powerful because you can combine it with the existing statistical NER available in a standard spacy model to achieve some of the "fuzzy matching" you mention. From the docs:
The entity ruler is designed to integrate with spaCy’s existing statistical models and enhance the named entity recognizer. If it’s added before the "ner" component, the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it’s added after the "ner" component, the entity ruler will only add spans to the doc.ents if they don’t overlap with existing entities predicted by the model. To overwrite overlapping entities, you can set overwrite_ents=True on initialization.
I use the Matcher with dynamically generated callbacks. I think it works well.
I got curious why the Matcher doesn't support fuzzy matching, and found this comment by the author of spacy on a closed issue.
You really want to precompute the search sets, rather than do them on-the-fly in the matcher. Once you've precomputed the similarity values, you can use extension attributes and a >= comparison in the Matcher to perform the search. I think this is a case where the implementation details strongly matter, and an API that obscures them would actually be a disservice.
I think this is a good point, and it tells you how to build what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With