Is it possible to use NER-Label in Entity Linking candidate generation in spaCy?

Question

I want to use spaCy for Entity Linking (EL). I already trained a spaCy Named Entity Recognition (NER) model with custom labels on my domain-specific corpus. However my following example will be using the regular entity labels PERSON and LOCATION.

Setting aliases in the Knowledge Base (KB), the KB returns candidates for occurences of recognized entities, e.g. candidates for "Paris" can be the Wikidata entry Q47899 (Paris Hilton), Q7137357 (Paris Themmen), Q5214166 (Dan Paris), Q90 (Paris, capital of France), or Q830149 (Paris, county seat of Lamar County, Texas, United States).

My question concerns the recognized entity label. If the NER recognizes "Paris" as PERSON, this excludes Q90 (Paris, capital of France) and Q830149 (Paris, county seat of Lamar County, Texas, United States) from the candidates, leaving 3 candidates. Whereas if "Paris" was recognized as LOCATION, there are only the other 2 candidates.

Is it possible to advise the KB or EL model somehow from which set of entities to chose the candidates, given the detected NER label? Before or after training the EL model?

Sofie VL · Accepted Answer

This is currently not implemented in spaCy. Generally speaking, these would be the steps needed to get to the functionality you want:

Create some sort of mapping between your KB entities (Wikidata identifiers) and your NER Labels. This won't be exactly trivial. You need to either parse the wikidata "instance of" meta information, or use the Wikipedia classification system which has its pitfalls. Either way, you need to end up with an automated way of defining that Q830149 is-a "LOCATION" etc.
Store the "NER labels" for each entity. This could be done in the KB, but then the Cython structures need to be edited.
Reimplement the candidate generation (currently part of the KB: get_candidates method) to take a textual mention + its NER label, and only output relevant candidates for that specific label.

One caveat I'd like to point out, is that this approach may amplify errors from the NER step. Imagine that you're talking about Paris, the capital, but your NER gets it wrong and tags it as a "PERSON". With the approach described here, the NEL won't be able to recover from that, and will output the most likely person it can find, though none of them are correct.

Another approach would be to not change the candidate generator, but take the NER label into account as part of the scoring mechanism in the entity_linker pipe. Currently, it already combines two scores: one from the prior probability (using stats from a large training corpus), and one from the context (using ML and sentence similarity). The aspect of matching NER label could be included into that score, and then there will still be a chance of recognizing "PARIS" as the correct entity, even when its NER label is wrong. But it depends on how strict you'd want to enforce that.

Is it possible to use NER-Label in Entity Linking candidate generation in spaCy?

Tags:

python

named-entity-recognition

spacy

entity-linking

LBoss

1 Answers

Sofie VL

Recent Activity

Donate For Us

Is it possible to use NER-Label in Entity Linking candidate generation in spaCy?

Tags:

python

named-entity-recognition

spacy

entity-linking

LBoss

1 Answers

Sofie VL

Related questions

Recent Activity

Donate For Us