Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extending NLP entity extraction

We would like to identify from a simple search neighborhood and streets in various cities. We don't only use English but also various other Cyrillic languages. We need to be able to identify spelling mistakes of locations. When looking at python libraries, I found this one: http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html

We tried to play around with it, but cannot find a way to extend the entity recognition database. How can that be done?
If not is there any other suggestion for a multi lingual nlp that can help spell check and also extract various entities matching a custom database?

like image 820
Dory Zidon Avatar asked Nov 08 '22 20:11

Dory Zidon


1 Answers

Have a look at HuggingFace's pretrained models.

  1. They have a multilingual NER model trained on 40 languages, including Cyrillic languages like Russian. It's a fine-tuned version of RoBERTa, so accuracy seems to be very good. See details here: https://huggingface.co/jplu/tf-xlm-r-ner-40-lang
  2. They also have a multilingual DistilBERT model trained for typo detection based on the GitHub Typo Corpus. The corpus seems to include typos from 15 different languages, including Russian. See details here: https://huggingface.co/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection

Here is some example code from the documentation slightly altered for your use-case:

from transformers import pipeline

typo_checker = pipeline("ner", model="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection",
                        tokenizer="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection")

result = typo_checker("я живу в Мосве")
result[1:-1]

 #[{'word': 'я', 'score': 0.7886862754821777, 'entity': 'ok', 'index': 1},
 #{'word': 'жив', 'score': 0.6303715705871582, 'entity': 'ok', 'index': 2},
 #{'word': '##у', 'score': 0.7259598970413208, 'entity': 'ok', 'index': 3},
 #{'word': 'в', 'score': 0.7102937698364258, 'entity': 'ok', 'index': 4},
 #{'word': 'М', 'score': 0.5045614242553711, 'entity': 'ok', 'index': 5},
 #{'word': '##ос', 'score': 0.560469925403595, 'entity': 'typo', 'index': 6},
 #{'word': '##ве', 'score': 0.8228507041931152, 'entity': 'ok', 'index': 7}]

result = typo_checker("I live in Moskkow")
result[1:-1]

 #[{'word': 'I', 'score': 0.7598089575767517, 'entity': 'ok', 'index': 1},
 #{'word': 'live', 'score': 0.8173692226409912, 'entity': 'ok', 'index': 2},
 #{'word': 'in', 'score': 0.8289134502410889, 'entity': 'ok', 'index': 3},
 #{'word': 'Mo', 'score': 0.7344270944595337, 'entity': 'ok', 'index': 4},
 #{'word': '##sk', 'score': 0.6559176445007324, 'entity': 'ok', 'index': 5},
 #{'word': '##kow', 'score': 0.8762879967689514, 'entity': 'ok', 'index': 6}]

It doesn't seem to always work, unfortunately, but maybe it's sufficient for your use case.

Another option would be SpaCy. They don't have as many models for different languages, but with SpaCy's EntityRuler it's easy to manually define new entities i.e. "extend the entity recognition database".

like image 179
Moritz Avatar answered Nov 15 '22 05:11

Moritz