Computing precision and recall in Named Entity Recognition

Tags:

Now I am about to report the results from Named Entity Recognition. One thing that I find a bit confusing is that my understanding of precision and recall was that one simply sums up true positives, true negatives, false positives and false negatives over all classes.

But this seems implausible now that I think of it as each misclassification would give simultaneously rise to one false positive and one false negative (e.g. a token that should have been labelled as "A" but was labelled as "B" is a false negative for "A" and false positive for "B"). Thus the number of the false positives and the false negatives over all classes would be the same which means that precision is (always!) equal to recall. This simply can't be true so there is an error in my reasoning and I wonder where it is. It is certainly something quite obvious and straight-forward but it escapes me right now.

665

asked Nov 23 '09 15:11

Nick

2 Answers

The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)

[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today

This has 3 entities.

Supposing your actual extraction has the following

[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]

You have an exact match for Microsoft Corp, false positives for CEO and today, a false negative for Windows 7 and a substring match for Steve

We compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.

Exact match: True Positives = 1 (Microsoft Corp., the only exact match), False Positives =3 (CEO, today, and Steve, which isn't an exact match), False Negatives = 2 (Steve Ballmer and Windows 7)

Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25 Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33

Any Overlap OK: True Positives = 2 (Microsoft Corp., and Steve which overlaps Steve Ballmer), False Positives =2 (CEO, and today), False Negatives = 1 (Windows 7)

Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55 Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66

The reader is then left to infer that the "real performance" (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.

It's also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of "performance" when you have to trade off precision against recall.

179

answered Sep 24 '22 14:09

Ken Bloom

In the CoNLL-2003 NER task, the evaluation was based on correctly marked entities, not tokens, as described in the paper 'Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition'. An entity is correctly marked if the system identifies an entity of the correct type with the correct start and end point in the document. I prefer this approach in evaluation because it's closer to a measure of performance on the actual task; a user of the NER system cares about entities, not individual tokens.

However, the problem you described still exists. If you mark an entity of type ORG with type LOC you incur a false positive for LOC and a false negative for ORG. There is an interesting discussion on the problem in this blog post.

answered Sep 23 '22 14:09

Stompchicken

Related questions
                            
                                How to auto-tag content, algorithms and suggestions needed
                            
                                How to perform Lemmatization in R?
                            
                                What is NLTK POS tagger asking me to download?
                            
                                How is the Vader 'compound' polarity score calculated in Python NLTK?
                            
                                Computing N Grams using Python
                            
                                How can I do Train And Test step in Giza++?
                            
                                How is the TFIDFVectorizer in scikit-learn supposed to work?
                            
                                Java library for keywords extraction from input text [closed]
                            
                                N-gram generation from a sentence
                            
                                Does an algorithm exist to help detect the "primary topic" of an English sentence?
                            
                                Convert words between verb/noun/adjective forms
                            
                                What’s a good Python profanity filter library? [closed]
                            
                                Word frequency algorithm for natural language processing
                            
                                FreqDist with NLTK
                            
                                Classifying Documents into Categories
                            
                                Python - How to intuit word from abbreviated text using NLP?
                            
                                summarize text or simplify text [closed]
                            
                                Is there a natural language parser for date/times in javascript?
                            
                                Python Gensim: how to calculate document similarity using the LDA model?
                            
                                How to interpret scikit's learn confusion matrix and classification report?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With