extending NLP entity extraction

Question

We would like to identify from a simple search neighborhood and streets in various cities. We don't only use English but also various other Cyrillic languages. We need to be able to identify spelling mistakes of locations. When looking at python libraries, I found this one: http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html

We tried to play around with it, but cannot find a way to extend the entity recognition database. How can that be done?
If not is there any other suggestion for a multi lingual nlp that can help spell check and also extract various entities matching a custom database?

Moritz · Accepted Answer

Have a look at HuggingFace's pretrained models.

They have a multilingual NER model trained on 40 languages, including Cyrillic languages like Russian. It's a fine-tuned version of RoBERTa, so accuracy seems to be very good. See details here: https://huggingface.co/jplu/tf-xlm-r-ner-40-lang
They also have a multilingual DistilBERT model trained for typo detection based on the GitHub Typo Corpus. The corpus seems to include typos from 15 different languages, including Russian. See details here: https://huggingface.co/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection

Here is some example code from the documentation slightly altered for your use-case:

from transformers import pipeline

typo_checker = pipeline("ner", model="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection",
                        tokenizer="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection")

result = typo_checker("я живу в Мосве")
result[1:-1]

 #[{'word': 'я', 'score': 0.7886862754821777, 'entity': 'ok', 'index': 1},
 #{'word': 'жив', 'score': 0.6303715705871582, 'entity': 'ok', 'index': 2},
 #{'word': '##у', 'score': 0.7259598970413208, 'entity': 'ok', 'index': 3},
 #{'word': 'в', 'score': 0.7102937698364258, 'entity': 'ok', 'index': 4},
 #{'word': 'М', 'score': 0.5045614242553711, 'entity': 'ok', 'index': 5},
 #{'word': '##ос', 'score': 0.560469925403595, 'entity': 'typo', 'index': 6},
 #{'word': '##ве', 'score': 0.8228507041931152, 'entity': 'ok', 'index': 7}]

result = typo_checker("I live in Moskkow")
result[1:-1]

 #[{'word': 'I', 'score': 0.7598089575767517, 'entity': 'ok', 'index': 1},
 #{'word': 'live', 'score': 0.8173692226409912, 'entity': 'ok', 'index': 2},
 #{'word': 'in', 'score': 0.8289134502410889, 'entity': 'ok', 'index': 3},
 #{'word': 'Mo', 'score': 0.7344270944595337, 'entity': 'ok', 'index': 4},
 #{'word': '##sk', 'score': 0.6559176445007324, 'entity': 'ok', 'index': 5},
 #{'word': '##kow', 'score': 0.8762879967689514, 'entity': 'ok', 'index': 6}]

It doesn't seem to always work, unfortunately, but maybe it's sufficient for your use case.

Another option would be SpaCy. They don't have as many models for different languages, but with SpaCy's EntityRuler it's easy to manually define new entities i.e. "extend the entity recognition database".

extending NLP entity extraction

Tags:

python

machine-learning

nlp

polyglot

named-entity-extraction

Dory Zidon

1 Answers

Moritz

Recent Activity

Donate For Us

extending NLP entity extraction

Tags:

python

machine-learning

nlp

polyglot

named-entity-extraction

Dory Zidon

1 Answers

Moritz

Related questions

Recent Activity

Donate For Us