I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject.
The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages.
Trivial example (Welsh + English): http://wales.gov.uk/
I'm currently using a mix of:
I have working code but am concerned it may be naive or needlessly re-inventing a wheel. Has anyone else done this before?
Open Word. Click File > Options > Advanced. Under Editing options, select the Automatically switch keyboard to match language of surrounding text check box. Note: The Automatically switch keyboard to match language of surrounding text check box is only visible after you enable a keyboard layout for a language.
You can use API (Google & Yandex) for spell check and language detection - but this option is not very scalable I think.
Other option is to use free lucene tools for spellchecking http://wiki.apache.org/lucene-java/SpellChecker, but you have to index some corpra first - Wikipedia is good choice. LD can be archived by http://textcat.sourceforge.net/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With