I'm interested in writing a PHP script (I do welcome language-agnostic suggestions) that would transliterate a sentence or word written in English (phoenetically) into the script of another language. Since I'm looking at English written phoenetically (i.e. by ear): I'd have to deal with variant spellings of the same word.
It is assumed that no standard exists for romanization (for instance, in Chinese, you have the Simplified Wade, etc.)
Does anyone have any advice on where I could start?
EDIT: I'm doing this purely for educational purposes, and I was initially under the impression that in order to figure out the connection between variant spellings (which could be found in a corpus of IM messages, Facebook posts written in the romanized form of the language), you'd need some sort of machine learning tool. However, I'd like to know if I was on the right track, and I'd like some help in figuring out what next I should look into to get this working (for instance: which machine learning tool should I look into?).
Try Transliteration PHP Extension by Derick Rethans:
This extension allows you to transliterate text in non-latin characters (such as Chinese, Cyrillic, Greek etc) to latin characters. Besides the transliteration the extension also contains filters to upper- and lowercase latin, cyrillic and greek, and perform special forms of transliteration such as converting ligatures such as the Norwegian "æ" to "ae" and normalizing punctuation and spacing.
It seems he has already started on just what you are looking for! (unless you want to deal with english-> latin language, but at least this deals with scripts of other languages. :) )
I know with Japanese at least, you have a set number of letter combinations.
So, you could do something like create a matching array like this
array(
'oo' => 'おう',
'oh' => 'おう',
'ou' => 'おう'
)
Of course, continuing on, and making sure you don't match 'su', when it should be 'tsu'.
This would only be a starting point, of course.
Machine learning is probably most practical with Chinese...but here's a rough start to hiragana: https://gist.github.com/1154969
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With