We would like to identify from a simple search neighborhood and streets in various cities. We don't only use English but also various other Cyrillic languages. We need to be able to identify spelling mistakes of locations. When looking at python libraries, I found this one: http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html
We tried to play around with it, but cannot find a way to extend the entity recognition database. How can that be done?
If not is there any other suggestion for a multi lingual nlp that can help spell check and also extract various entities matching a custom database?
Have a look at HuggingFace's pretrained models.
Here is some example code from the documentation slightly altered for your use-case:
from transformers import pipeline
typo_checker = pipeline("ner", model="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection",
tokenizer="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection")
result = typo_checker("я живу в Мосве")
result[1:-1]
#[{'word': 'я', 'score': 0.7886862754821777, 'entity': 'ok', 'index': 1},
#{'word': 'жив', 'score': 0.6303715705871582, 'entity': 'ok', 'index': 2},
#{'word': '##у', 'score': 0.7259598970413208, 'entity': 'ok', 'index': 3},
#{'word': 'в', 'score': 0.7102937698364258, 'entity': 'ok', 'index': 4},
#{'word': 'М', 'score': 0.5045614242553711, 'entity': 'ok', 'index': 5},
#{'word': '##ос', 'score': 0.560469925403595, 'entity': 'typo', 'index': 6},
#{'word': '##ве', 'score': 0.8228507041931152, 'entity': 'ok', 'index': 7}]
result = typo_checker("I live in Moskkow")
result[1:-1]
#[{'word': 'I', 'score': 0.7598089575767517, 'entity': 'ok', 'index': 1},
#{'word': 'live', 'score': 0.8173692226409912, 'entity': 'ok', 'index': 2},
#{'word': 'in', 'score': 0.8289134502410889, 'entity': 'ok', 'index': 3},
#{'word': 'Mo', 'score': 0.7344270944595337, 'entity': 'ok', 'index': 4},
#{'word': '##sk', 'score': 0.6559176445007324, 'entity': 'ok', 'index': 5},
#{'word': '##kow', 'score': 0.8762879967689514, 'entity': 'ok', 'index': 6}]
It doesn't seem to always work, unfortunately, but maybe it's sufficient for your use case.
Another option would be SpaCy. They don't have as many models for different languages, but with SpaCy's EntityRuler it's easy to manually define new entities i.e. "extend the entity recognition database".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With