I've been thinking about using Markov techniques to restore missing information to natural language text.
That seems to be in order of least difficult to most difficult. Basically the problem is resolving ambiguities based on context.
I can use Wiktionary as a dictionary and Wikipedia as a corpus using n-grams and Hidden Markov Models to resolve the ambiguities.
Am I on the right track? Are there already some services, libraries, or tools for this sort of thing?
Examples
I think you can use Markov models (HMMs) for all three tasks, but also take a look at more modern models such as conditional random fields (CRFs). Also, here's some boost for your google-fu:
This is called truecasing.
I suspect Markov models are going to have a hard time on this. OTOH, labelled training data is free since you can just take a bunch of accented text in the target language and strip the accents. See also next answer.
This seems strongly related to machine transliteration, which has been tried using pair HMMs (from bioinformatics/genome work).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With