Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Markov models to convert all-caps to mixed-case and related problems

I've been thinking about using Markov techniques to restore missing information to natural language text.

  • Restore all-caps text to mixed-case.
  • Restore accents / diacritics to languages which should have them but have been converted to plain ASCII.
  • Convert rough phonetic transcriptions back into native alphabets.

That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.

I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Hidden Markov Models to resolve the ambiguities.

Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?

Examples

  • GEORGE LOST HIS SIM CARD IN THE BUSH   ⇨   George lost his SIM card in the bush
  • tantot il rit a gorge deployee   ⇨   tantôt il rit à gorge déployée
like image 698
hippietrail Avatar asked Dec 21 '10 02:12

hippietrail


1 Answers

I think you can use Markov models (HMMs) for all three tasks, but also take a look at more modern models such as conditional random fields (CRFs). Also, here's some boost for your google-fu:

  • Restore mixed case to text in all caps

This is called truecasing.

  • Restore accents / diacritics to languages which should have them but have been converted to plain ASCII

I suspect Markov models are going to have a hard time on this. OTOH, labelled training data is free since you can just take a bunch of accented text in the target language and strip the accents. See also next answer.

  • Convert rough phonetic transcriptions back into native alphabets

This seems strongly related to machine transliteration, which has been tried using pair HMMs (from bioinformatics/genome work).

like image 150
Fred Foo Avatar answered Nov 11 '22 19:11

Fred Foo