Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Romanization of Unicode text

I am looking for a way to transliterate Unicode letter characters from any language into accented Latin letters. The intent is to allow foreigners to gain insight into the pronunciation of names and words written in any non-Latin script.

Examples:

Greek:Romanize("Αλφαβητικός") returns "Alphabētikós" (or "Alfavi̱tikós")

Japanese:Romanize("しんばし") returns "shimbashi" (or "sinbasi")

Russian:Romanize("яйца Фаберже") returns "yaytsa Faberzhe" (or "jajca Faberže")

It should ideally support characters in the following scripts: CJK, Indic, Cyrillic, Semitic, and Greek. It should to be data driven and extensible, using data from either the Unicode Consortium, the USA, the EU or the UN. The code should be open source written in .NET or Java.

Does such a library exist?

like image 426
Anthony Faull Avatar asked Mar 23 '12 16:03

Anthony Faull


People also ask

What is romanized text?

Romanization or romanisation, in linguistics, is the conversion of text from a different writing system to the Roman (Latin) script, or a system for doing so. Methods of romanization include transliteration, for representing written text, and transcription, for representing the spoken word, and combinations of both.

Is romanization the same as transliteration?

Romanization refers to the process of representing non-Latin scripts into Roman (Latin) Alphabet. Transliteration, on the other hand, literally refers to converting one script into another.


3 Answers

The problem is a lot more complex than you think.

Greek, Cyrillic, Indic scripts, Georgian -> trivial, you could program that in an hour
Thai, Japanese Kana -> doable with a bit more effort
Japanese Kanji, Chinese -> these are not alphabets/syllaberies, so you're not in fact transliterating, you're looking up the pronunciation of each symbol in a hopefully large dictionary (EDICT and CCDICT should work), and a lot of times you'll get it wrong unless you're also considering the context, especially in Japanese
Korean -> technically an alphabet, but computers can only handle the composed characters, so you need another large database, I'm not aware of any
Arabic, Hebrew -> these languages don't write down short vowels, so a lot of times your transliteration will be something unreadable like "bytlhm" (Bethlehem). I'm not aware of any large databases that map Arabic or Hebrew words to their pronunciation.

like image 106
Sprachprofi Avatar answered Sep 25 '22 05:09

Sprachprofi


You can use Unidecode Sharp :

[a C#] port from Python Unidecode that itself port from Perl unidecode. (there are also PHP and Ruby implementations available)

Usage;

using BinaryAnalysis.UnidecodeSharp;  .......................................  string _Greek="Αλφαβητικός"; MessageBox.Show(_Greek.Unidecode());  string _Japan ="しんばし"; MessageBox.Show(_Japan.Unidecode());  string _Russian ="яйца Фаберже"; MessageBox.Show(_Russian.Unidecode()); 

I hope, it will be good for you.

like image 32
Kerberos Avatar answered Sep 23 '22 05:09

Kerberos


I am unaware of any open source solution here beyond ICU. If ICU works for you, great. If not, note that I am the CTO of a company that sells a commercial produce for this purpose that can deal with the icky cases like Chinese words, Japanese multiple reading, and Arabic incomplete orthography.

like image 29
bmargulies Avatar answered Sep 25 '22 05:09

bmargulies