Named Entity Recognition with Huggingface transformers, mapping back to complete entities

Tags:

huggingface-transformers

I'm looking at the documentation for Huggingface pipeline for Named Entity Recognition, and it's not clear to me how these results are meant to be used in an actual entity recognition model.

For instance, given the example in documentation:

>>> from transformers import pipeline

>>> nlp = pipeline("ner")

>>> sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
...            "close to the Manhattan Bridge which is visible from the window."

This outputs a list of all words that have been identified as an entity from the 9 classes     defined above. Here is the expected results:

print(nlp(sequence))

[
{'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
{'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
{'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
{'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
{'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
{'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

While this alone is impressive, it isn't clear to me the correct way to get "DUMBO" from:

{'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
{'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
{'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},

---or even to the cleaner multiple token matches, like distinguishing "New York City" from simply the city of "York."

While I can imagine heuristic methods, what's the correct intended way to join these tokens back into correct labels given your inputs?

495

asked Aug 02 '20 23:08

Mittenchops

2 Answers

The pipeline object can do that for you when you set the parameter:

transformers < 4.7.0: grouped_entities to True.
transformers >= 4.7.0: aggregation_strategy to simple

from transformers import pipeline

#transformers < 4.7.0
#ner = pipeline("ner", grouped_entities=True)

ner = pipeline("ner", aggregation_strategy='simple')

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."

output = ner(sequence)

print(output)

Output:

[{'entity_group': 'I-ORG', 'score': 0.9970663785934448, 'word': 'Hugging Face Inc'}
, {'entity_group': 'I-LOC', 'score': 0.9993778467178345, 'word': 'New York City'}
, {'entity_group': 'I-LOC', 'score': 0.9571147759755453, 'word': 'DUMBO'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}
, {'entity_group': 'I-LOC', 'score': 0.9838141202926636, 'word': 'Manhattan Bridge'}]

answered Oct 08 '22 09:10

cronoik

Quick update: grouped_entities has been deprecated.

UserWarning: grouped_entities is deprecated and will be removed in version v5.0.0, defaulted to aggregation_strategy="AggregationStrategy.SIMPLE" instead.
f'grouped_entities is deprecated and will be removed in version v5.0.0, defaulted to aggregation_strategy="{aggregation_strategy}" instead.'

you will have to change your code to:

ner = pipeline("ner", aggregation_stategy="simple")

answered Oct 08 '22 10:10

SilentCloud

Related questions
                            
                                What are the inputs to the transformer encoder and decoder in BERT?
                            
                                How do I use BertForMaskedLM or BertModel to calculate perplexity of a sentence?
                            
                                How to fine tune BERT on unlabeled data?
                            
                                Downloading transformers models to use offline
                            
                                How exactly should the input file be formatted for the language model finetuning (BERT through Huggingface Transformers)?
                            
                                Save only best weights with huggingface transformers
                            
                                BERT tokenizer & model download
                            
                                Huggingface transformer model returns string instead of logits
                            
                                How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?
                            
                                Huggingface AlBert tokenizer NoneType error with Colab
                            
                                How do I train a encoder-decoder model for a translation task using hugging face transformers?
                            
                                why take the first hidden state for sequence classification (DistilBertForSequenceClassification) by HuggingFace
                            
                                Transformer: Error importing packages. "ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler'"
                            
                                Use of attention_mask during the forward pass in lm finetuning
                            
                                HuggingFace BERT `inputs_embeds` giving unexpected result
                            
                                Understanding BERT vocab [unusedxxx] tokens:
                            
                                PyTorch torch.no_grad() versus requires_grad=False
                            
                                How to make a Trainer pad inputs in a batch with huggingface-transformers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With