Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spaCy coreference resolution - named entity recognition (NER) to return unique entity ID's?

Perhaps I've skipped over a part of the docs, but what I am trying to determine is a unique ID for each entity in the standard NER toolset. For example:

import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()

text = "This is a text about Apple Inc based in San Fransisco. "\
        "And here is some text about Samsung Corp. "\
        "Now, here is some more text about Apple and its products for customers in Norway"

doc = nlp(text)

for ent in doc.ents:
    print('ID:{}\t{}\t"{}"\t'.format(ent.label,ent.label_,ent.text,))


displacy.render(doc, jupyter=True, style='ent')

returns:

ID:381    ORG "Apple Inc" 
ID:382    GPE "San Fransisco" 
ID:381    ORG "Samsung Corp." 
ID:381    ORG "Apple" 
ID:382    GPE "Norway"

I have been looking at ent.ent_id and ent.ent_id_ but these are inactive according to the docs. I couldn't find anything in ent.root either.

For example, in GCP NLP each entity is returned with an ⟨entity⟩number that enables you to identify multiple instances of the same entity within a text.

This is a ⟨text⟩2 about ⟨Apple Inc⟩1 based in ⟨San Fransisco⟩4. And here is some ⟨text⟩3 about ⟨Samsung Corp⟩6. Now, here is some more ⟨text⟩8 about ⟨Apple⟩1 and its ⟨products⟩5 for ⟨customers⟩7 in ⟨Norway⟩9"

Does spaCy support something similar? Or is there a way using NLTK or Stanford?

like image 482
BenP Avatar asked Dec 12 '18 19:12

BenP


People also ask

What is Coreference resolution in NLP?

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.

How does spaCy do Named Entity Recognition?

Spacy comes with an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens. Spacy provides an option to add arbitrary classes to entity recognition systems and update the model to even include the new examples apart from already defined entities within the model.

What types of entities does spaCy recognize?

SpaCy provides an exceptionally efficient statistical system for NER in python, which can assign labels to groups of tokens which are contiguous. It provides a default model which can recognize a wide range of named or numerical entities, which include person, organization, language, event etc.

How can I improve my spaCy NER accuracy?

Re-train spacy NER with your custom examples: If you have, for instance, a few hundred examples with real addresses, you can manually TAG it and then re-train the spacy NER to overfit your particular address. You can train a new NER from scratch or fine-tune an existing one.


1 Answers

You can use neuralcoref library to get coreference resolution working with SpaCy's models as:

# Load your usual SpaCy model (one of SpaCy English models)
import spacy
nlp = spacy.load('en')

# Add neural coref to SpaCy's pipe
import neuralcoref
neuralcoref.add_to_pipe(nlp)

# You're done. You can now use NeuralCoref as you usually manipulate a SpaCy document annotations.
doc = nlp(u'My sister has a dog. She loves him.')

doc._.has_coref
doc._.coref_clusters

Find the installation and usage instructions here: https://github.com/huggingface/neuralcoref

like image 177
scorp Avatar answered Sep 28 '22 17:09

scorp