Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multilingual spell checking with language detection

I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject.

The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages.

Trivial example (Welsh + English): http://wales.gov.uk/

I'm currently using a mix of:

  • Character distribution (e.g. 0600-06FF = Arabic etc)
  • n-Grams to discern languages with similar characters
  • Dictionary lookup to discern locale, i.e. en-US, en-GB

I have working code but am concerned it may be naive or needlessly re-inventing a wheel. Has anyone else done this before?

like image 593
Oliver Emberton Avatar asked May 03 '11 17:05

Oliver Emberton


People also ask

Why is my spell check in a different language?

Open Word. Click File > Options > Advanced. Under Editing options, select the Automatically switch keyboard to match language of surrounding text check box. Note: The Automatically switch keyboard to match language of surrounding text check box is only visible after you enable a keyboard layout for a language.


1 Answers

You can use API (Google & Yandex) for spell check and language detection - but this option is not very scalable I think.

Other option is to use free lucene tools for spellchecking http://wiki.apache.org/lucene-java/SpellChecker, but you have to index some corpra first - Wikipedia is good choice. LD can be archived by http://textcat.sourceforge.net/

like image 97
yura Avatar answered Nov 15 '22 09:11

yura