I want to use spaCy for Entity Linking (EL). I already trained a spaCy Named Entity Recognition (NER) model with custom labels on my domain-specific corpus. However my following example will be using the regular entity labels PERSON and LOCATION.
Setting aliases in the Knowledge Base (KB), the KB returns candidates for occurences of recognized entities, e.g. candidates for "Paris" can be the Wikidata entry Q47899 (Paris Hilton), Q7137357 (Paris Themmen), Q5214166 (Dan Paris), Q90 (Paris, capital of France), or Q830149 (Paris, county seat of Lamar County, Texas, United States).
My question concerns the recognized entity label. If the NER recognizes "Paris" as PERSON, this excludes Q90 (Paris, capital of France) and Q830149 (Paris, county seat of Lamar County, Texas, United States) from the candidates, leaving 3 candidates. Whereas if "Paris" was recognized as LOCATION, there are only the other 2 candidates.
Is it possible to advise the KB or EL model somehow from which set of entities to chose the candidates, given the detected NER label? Before or after training the EL model?
This is currently not implemented in spaCy. Generally speaking, these would be the steps needed to get to the functionality you want:
Q830149 is-a "LOCATION" etc.get_candidates method) to take a textual mention + its NER label, and only output relevant candidates for that specific label.One caveat I'd like to point out, is that this approach may amplify errors from the NER step. Imagine that you're talking about Paris, the capital, but your NER gets it wrong and tags it as a "PERSON". With the approach described here, the NEL won't be able to recover from that, and will output the most likely person it can find, though none of them are correct.
Another approach would be to not change the candidate generator, but take the NER label into account as part of the scoring mechanism in the entity_linker pipe. Currently, it already combines two scores: one from the prior probability (using stats from a large training corpus), and one from the context (using ML and sentence similarity). The aspect of matching NER label could be included into that score, and then there will still be a chance of recognizing "PARIS" as the correct entity, even when its NER label is wrong. But it depends on how strict you'd want to enforce that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With