I'm looking at the documentation for Huggingface pipeline for Named Entity Recognition, and it's not clear to me how these results are meant to be used in an actual entity recognition model.
For instance, given the example in documentation:
>>> from transformers import pipeline
>>> nlp = pipeline("ner")
>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
... "close to the Manhattan Bridge which is visible from the window."
This outputs a list of all words that have been identified as an entity from the 9 classes defined above. Here is the expected results:
print(nlp(sequence))
[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
While this alone is impressive, it isn't clear to me the correct way to get "DUMBO" from:
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
---or even to the cleaner multiple token matches, like distinguishing "New York City" from simply the city of "York."
While I can imagine heuristic methods, what's the correct intended way to join these tokens back into correct labels given your inputs?
NER with Spacy Spacy is an open-source NLP library for advanced Natural Language Processing in Python and Cython. It's well maintained and has over 20K stars on Github. There are several pre-trained models in Spacy that you can use directly on your data for tasks like NER, Information Extraction etc.
Named Entity Recognition is a process where an algorithm takes a string of text (sentence or paragraph) as input and identifies relevant nouns (people, places, and organizations) that are mentioned in that string.
There are three major approaches to NER: lexicon-based, rule-based, and machine learning based.
In BERT, the id 101 is reserved for the special [CLS] token, the id 102 is reserved for the special [SEP] token, and the id 0 is reserved for [PAD] token. token_type_ids : To identify the sequence in which a token belongs to. Since we only have one sequence per text, then all the values of token_type_ids will be 0.
The pipeline object can do that for you when you set the parameter:
True
.simple
from transformers import pipeline
#transformers < 4.7.0
#ner = pipeline("ner", grouped_entities=True)
ner = pipeline("ner", aggregation_strategy='simple')
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
output = ner(sequence)
print(output)
Output:
[{'entity_group': 'I-ORG', 'score': 0.9970663785934448, 'word': 'Hugging Face Inc'}
, {'entity_group': 'I-LOC', 'score': 0.9993778467178345, 'word': 'New York City'}
, {'entity_group': 'I-LOC', 'score': 0.9571147759755453, 'word': 'DUMBO'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}]
Quick update: grouped_entities
has been deprecated.
UserWarning:
grouped_entities
is deprecated and will be removed in version v5.0.0, defaulted toaggregation_strategy="AggregationStrategy.SIMPLE"
instead.
f'grouped_entities
is deprecated and will be removed in version v5.0.0, defaulted toaggregation_strategy="{aggregation_strategy}"
instead.'
you will have to change your code to:
ner = pipeline("ner", aggregation_stategy="simple")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With