I am trying to evaluate a trained NER Model created using spacy lib. Normally for these kind of problems you can use f1 score (a ratio between precision and recall). I could not find in the documentation an accuracy function for a trained NER model.
I am not sure if it's correct but I am trying to do it with the following way(example) and using f1_score
from sklearn
:
from sklearn.metrics import f1_score import spacy from spacy.gold import GoldParse nlp = spacy.load("en") #load NER model test_text = "my name is John" # text to test accuracy doc_to_test = nlp(test_text) # transform the text to spacy doc format # we create a golden doc where we know the tagged entity for the text to be tested doc_gold_text= nlp.make_doc(test_text) entity_offsets_of_gold_text = [(11, 15,"PERSON")] gold = GoldParse(doc_gold_text, entities=entity_offsets_of_gold_text) # bring the data in a format acceptable for sklearn f1 function y_true = ["PERSON" if "PERSON" in x else 'O' for x in gold.ner] y_predicted = [x.ent_type_ if x.ent_type_ !='' else 'O' for x in doc_to_test] f1_score(y_true, y_predicted, average='macro')`[1] > 1.0
Any thoughts are or insights are useful.
As name implies, this command will evaluate a model accuracy and speed. It will be done on JSON'-formatted annotated data. Evaluate command will print the results and optionally export displaCy visualisations of a sample set of parsers to HTML files (.
spaCy has a NER accuracy of 85.85%, so something in that range would be nice for our FOOD entities.
Text Processing using spaCy | NLP Library Named Entity Recognition NER works by locating and identifying the named entities present in unstructured text into the standard categories such as person names, locations, organizations, time expressions, quantities, monetary values, percentage, codes etc.
You can find different metrics including F-score, recall and precision in spaCy/scorer.py.
This example shows how you can use it:
import spacy from spacy.gold import GoldParse from spacy.scorer import Scorer def evaluate(ner_model, examples): scorer = Scorer() for input_, annot in examples: doc_gold_text = ner_model.make_doc(input_) gold = GoldParse(doc_gold_text, entities=annot) pred_value = ner_model(input_) scorer.score(pred_value, gold) return scorer.scores # example run examples = [ ('Who is Shaka Khan?', [(7, 17, 'PERSON')]), ('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')]) ] ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm' results = evaluate(ner_model, examples)
The scorer.scores
returns multiple scores. When running the example, the result looks like this: (Note the low scores occuring because the examples classify London and Berlin as 'LOC' while the model classifies them as 'GPE'. You can figure this out by looking at the ents_per_type
.)
{'uas': 0.0, 'las': 0.0, 'las_per_type': {'attr': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'root': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'compound': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'nsubj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'dobj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'cc': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'conj': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'ents_p': 33.33333333333333, 'ents_r': 33.33333333333333, 'ents_f': 33.33333333333333, 'ents_per_type': {'PERSON': {'p': 100.0, 'r': 100.0, 'f': 100.0}, 'LOC': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'GPE': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'tags_acc': 0.0, 'token_acc': 100.0, 'textcat_score': 0.0, 'textcats_per_cat': {}}
The example is taken from a spaCy example on github (link does not work anymore). It was last tested with spacy 2.2.4.
Note that in spaCy v3 there is an evaluate
command you can use easily from the command line instead of writing custom code to handle things.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With