Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Intelligent transliteration in PHP

Tags:

php

nlp

I'm interested in writing a PHP script (I do welcome language-agnostic suggestions) that would transliterate a sentence or word written in English (phoenetically) into the script of another language. Since I'm looking at English written phoenetically (i.e. by ear): I'd have to deal with variant spellings of the same word.

It is assumed that no standard exists for romanization (for instance, in Chinese, you have the Simplified Wade, etc.)

Does anyone have any advice on where I could start?

EDIT: I'm doing this purely for educational purposes, and I was initially under the impression that in order to figure out the connection between variant spellings (which could be found in a corpus of IM messages, Facebook posts written in the romanized form of the language), you'd need some sort of machine learning tool. However, I'd like to know if I was on the right track, and I'd like some help in figuring out what next I should look into to get this working (for instance: which machine learning tool should I look into?).

like image 549
arkate Avatar asked Aug 16 '11 14:08

arkate


2 Answers

Try Transliteration PHP Extension by Derick Rethans:

This extension allows you to transliterate text in non-latin characters (such as Chinese, Cyrillic, Greek etc) to latin characters. Besides the transliteration the extension also contains filters to upper- and lowercase latin, cyrillic and greek, and perform special forms of transliteration such as converting ligatures such as the Norwegian "æ" to "ae" and normalizing punctuation and spacing.

It seems he has already started on just what you are looking for! (unless you want to deal with english-> latin language, but at least this deals with scripts of other languages. :) )

like image 115
djhaskin987 Avatar answered Oct 18 '22 16:10

djhaskin987


I know with Japanese at least, you have a set number of letter combinations.

So, you could do something like create a matching array like this

array(
  'oo' => 'おう',
  'oh' => 'おう',
  'ou' => 'おう'
)

Of course, continuing on, and making sure you don't match 'su', when it should be 'tsu'.

This would only be a starting point, of course.

Machine learning is probably most practical with Chinese...but here's a rough start to hiragana: https://gist.github.com/1154969

like image 2
timw4mail Avatar answered Oct 18 '22 17:10

timw4mail