Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Named Entity Recognition with Huggingface transformers, mapping back to complete entities

I'm looking at the documentation for Huggingface pipeline for Named Entity Recognition, and it's not clear to me how these results are meant to be used in an actual entity recognition model.

For instance, given the example in documentation:

>>> from transformers import pipeline

>>> nlp = pipeline("ner")

>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
...            "close to the Manhattan Bridge which is visible from the window."

This outputs a list of all words that have been identified as an entity from the 9 classes     defined above. Here is the expected results:

print(nlp(sequence))

[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

While this alone is impressive, it isn't clear to me the correct way to get "DUMBO" from:

{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},

---or even to the cleaner multiple token matches, like distinguishing "New York City" from simply the city of "York."

While I can imagine heuristic methods, what's the correct intended way to join these tokens back into correct labels given your inputs?

like image 495
Mittenchops Avatar asked Aug 02 '20 23:08

Mittenchops


People also ask

Which model is best for Named Entity Recognition task?

NER with Spacy Spacy is an open-source NLP library for advanced Natural Language Processing in Python and Cython. It's well maintained and has over 20K stars on Github. There are several pre-trained models in Spacy that you can use directly on your data for tasks like NER, Information Extraction etc.

How is Named Entity Recognition done?

Named Entity Recognition is a process where an algorithm takes a string of text (sentence or paragraph) as input and identifies relevant nouns (people, places, and organizations) that are mentioned in that string.

What are the different methods for named entity extraction?

There are three major approaches to NER: lexicon-based, rule-based, and machine learning based.

How do you use BERT for Named Entity Recognition?

In BERT, the id 101 is reserved for the special [CLS] token, the id 102 is reserved for the special [SEP] token, and the id 0 is reserved for [PAD] token. token_type_ids : To identify the sequence in which a token belongs to. Since we only have one sequence per text, then all the values of token_type_ids will be 0.


2 Answers

The pipeline object can do that for you when you set the parameter:

  • transformers < 4.7.0: grouped_entities to True.
  • transformers >= 4.7.0: aggregation_strategy to simple
from transformers import pipeline

#transformers < 4.7.0
#ner = pipeline("ner", grouped_entities=True)

ner = pipeline("ner", aggregation_strategy='simple')

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."

output = ner(sequence)

print(output)

Output:

[{'entity_group': 'I-ORG', 'score': 0.9970663785934448, 'word': 'Hugging Face Inc'}
, {'entity_group': 'I-LOC', 'score': 0.9993778467178345, 'word': 'New York City'}
, {'entity_group': 'I-LOC', 'score': 0.9571147759755453, 'word': 'DUMBO'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}]
like image 74
cronoik Avatar answered Oct 08 '22 09:10

cronoik


Quick update: grouped_entities has been deprecated.

UserWarning: grouped_entities is deprecated and will be removed in version v5.0.0, defaulted to aggregation_strategy="AggregationStrategy.SIMPLE" instead.
f'grouped_entities is deprecated and will be removed in version v5.0.0, defaulted to aggregation_strategy="{aggregation_strategy}" instead.'

you will have to change your code to:

ner = pipeline("ner", aggregation_stategy="simple")
like image 2
SilentCloud Avatar answered Oct 08 '22 10:10

SilentCloud