Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get spaCy NER probability

I want to combine spaCy's NER engine with a separate NER engine (a BoW model). I'm currently comparing outputs from the two engines, trying to figure out what the optimal combination of the two would be. Both perform decently, but quite often spaCy finds entities that the BoW engine misses, and vice versa. What I would like is to access a probability score (or something similar) from spaCy whenever it finds an entity that is not found by the BoW engine. Can I get spaCy to print out its own probability score for a given entity it has found? As in, "Hi, I'm spaCy. I've found this token (or combination of tokens) that I'm X% certain is an entity of type BLAH." I want to know that number X every time spaCy finds an entity. I imagine there must be such a number somewhere internally in spaCy's NER engine, plus a threshold value below which the possible entity is not flagged as an entity, and I'd like to know how to get my hands on that number. Thanks in advance.

like image 954
Mede Avatar asked Oct 25 '17 14:10

Mede


People also ask

How does NER work in spaCy?

Text Processing using spaCy | NLP Library Named Entity Recognition NER works by locating and identifying the named entities present in unstructured text into the standard categories such as person names, locations, organizations, time expressions, quantities, monetary values, percentage, codes etc.

How accurate is spaCy NER?

spaCy has a NER accuracy of 85.85%, so something in that range would be nice for our FOOD entities.

What algorithm does spaCy use for NER?

Specifically for Named Entity Recognition, spacy uses: A transition based approach borrowed from shift-reduce parsers, which is described in the paper Neural Architectures for Named Entity Recognition by Lample et al. Matthew Honnibal describes how spaCy uses this on a YouTube video. A framework that's called "Embed.


1 Answers

Actually, there is an issue for that.

The author of the library, suggests there (among others) the following solution:

  1. Beam search with global objective. This is the standard solution: use a global objective, so that the parser model is trained to prefer parses that are better overall. Keep N different candidates, and output the best one. This can be used to support confidence by looking at the alternate analyses in the beam. If an entity occurs in every analysis, the NER is more confident it's correct.

Code:

import spacy
import sys
from collections import defaultdict

nlp = spacy.load('en')
text = u'Will Japan join the European Union? If yes, we should \ 
move to United States. Fasten your belts, America we are coming'


with nlp.disable_pipes('ner'):
    doc = nlp(text)

threshold = 0.2
(beams, somethingelse) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)

entity_scores = defaultdict(float)
for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

print ('Entities and scores (detected with beam search)')
for key in entity_scores:
    start, end, label = key
    score = entity_scores[key]
    if ( score > threshold):
        print ('Label: {}, Text: {}, Score: {}'.format(label, doc[start:end], score))

Sample output:

Entities and scores (detected with beam search)

Label: GPE, Text: Japan, Score: 0.9999999999999997

Label: GPE, Text: America, Score: 0.9991664575947963

Important note: The outputs you will get here are probably different from the outputs you would get using the Standard NER and not the beam search alternative. However, the beam search alternative provides you a metric of confidence that as I understand from your question is useful for your case.

Outputs with Standard NER for this example:

Label: GPE, Text: Japan

Label: ORG, Text: the European Union

Label: GPE, Text: United States

Label: GPE, Text: America

like image 98
gdaras Avatar answered Oct 12 '22 22:10

gdaras