Now I am about to report the results from Named Entity Recognition. One thing that I find a bit confusing is that my understanding of precision and recall was that one simply sums up true positives, true negatives, false positives and false negatives over all classes.
But this seems implausible now that I think of it as each misclassification would give simultaneously rise to one false positive and one false negative (e.g. a token that should have been labelled as "A" but was labelled as "B" is a false negative for "A" and false positive for "B"). Thus the number of the false positives and the false negatives over all classes would be the same which means that precision is (always!) equal to recall. This simply can't be true so there is an error in my reasoning and I wonder where it is. It is certainly something quite obvious and straight-forward but it escapes me right now.
“precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities present in the corpus that are found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.”
For evaluation, custom NER uses the following metrics: Precision: Measures how precise/accurate your model is. It is the ratio between the correctly identified positives (true positives) and all identified positives. The precision metric reveals how many of the predicted entities are correctly labeled.
The three major approaches to NER are lexicon, rules, and machine learning. Lexicon-based approaches utilize a lexicon or gazette constructed from external knowledge sources to match chunks of the text with entity names. Rule-based systems construct rules manually or automatically and use them for entity detection.
The trained NER model will learn to label entities not only from the pre-labelled training data. It will learn to find and recognise entities also depending on the given context.
The way precision and recall is typically computed (this is what I use in my papers) is to measure entities against each other. Supposing the ground truth has the following (without any differentiaton as to what type of entities they are)
[Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today
This has 3 entities.
Supposing your actual extraction has the following
[Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]
You have an exact match for Microsoft Corp
, false positives for CEO
and today
, a false negative for Windows 7
and a substring match for Steve
We compute precision and recall by first defining matching criteria. For example, do they have to be an exact match? Is it a match if they overlap at all? Do entity types matter? Typically we want to provide precision and recall for several of these criteria.
Exact match: True Positives = 1 (Microsoft Corp.
, the only exact match), False Positives =3 (CEO
, today
, and Steve
, which isn't an exact match), False Negatives = 2 (Steve Ballmer
and Windows 7
)
Precision = True Positives / (True Positives + False Positives) = 1/(1+3) = 0.25 Recall = True Positives / (True Positives + False Negatives) = 1/(1+2) = 0.33
Any Overlap OK: True Positives = 2 (Microsoft Corp.
, and Steve
which overlaps Steve Ballmer
), False Positives =2 (CEO
, and today
), False Negatives = 1 (Windows 7
)
Precision = True Positives / (True Positives + False Positives) = 2/(2+2) = 0.55 Recall = True Positives / (True Positives + False Negatives) = 2/(2+1) = 0.66
The reader is then left to infer that the "real performance" (the precision and recall that an unbiased human checker would give when allowed to use human judgement to decide which overlap discrepancies are significant, and which are not) is somewhere between the two.
It's also often useful to report the F1 measure, which is the harmonic mean of precision and recall, and which gives some idea of "performance" when you have to trade off precision against recall.
In the CoNLL-2003 NER task, the evaluation was based on correctly marked entities, not tokens, as described in the paper 'Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition'. An entity is correctly marked if the system identifies an entity of the correct type with the correct start and end point in the document. I prefer this approach in evaluation because it's closer to a measure of performance on the actual task; a user of the NER system cares about entities, not individual tokens.
However, the problem you described still exists. If you mark an entity of type ORG with type LOC you incur a false positive for LOC and a false negative for ORG. There is an interesting discussion on the problem in this blog post.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With