Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert chinese characters to hanyu pinyin

How to convert from chinese characters to hanyu pinyin?

E.g.

你 --> Nǐ

马 --> Mǎ


More Info:

Either accents or numerical forms of hanyu pinyin are acceptable, the numerical form being my preference.

A Java library is preferred, however, a library in another language that can be put in a wrapper is also OK.

I would like anyone who has personally used such a library before to recommend or comment on it, in terms of its quality/ reliabilitty.

like image 414
bguiz Avatar asked Dec 05 '10 23:12

bguiz


1 Answers

The problem of converting hanzi to pinyin is a fairly difficult one. There are many hanzi characters which have multiple pinyin representations, depending on context. Compare 长大 (pinyin: zhang da) to 长城 (pinyin: chang cheng). For this reason, single-character conversion is often actually useless, unless you have a system that outputs multiple possibilities. There is also the issue of word segmentation, which can affect the pinyin representation as well. Though perhaps you already knew this, I thought it was important to say this.

That said, the Adso Package contains both a segmenter and a probabilistic pinyin annotator, based on the excellent Adso library. It takes a while to get used to though, and may be much larger than you are looking for (I have found in the past that it was a bit too bulky for my needs). Additionally, there doesn't appear to be a public API anywhere, and its C++ ...

For a recent project, because I was working with place names, I simply used the Google Translate API (specifically, the unofficial java port, which, for common nouns at least, usually does a good job of translating to pinyin. The problem is commonly-used alternative transliteration systems, such as "HongKong" for what should be "XiangGang". Given all of this, Google Translate is pretty limited, but it offers a start. I hadn't heard of pinyin4j before, but after playing with it just now, I have found that it is less than optimal--while it outputs a list of potential candidate pinyin romanizations it makes no attempt to statistically determine their likelihood. There is a method to return a single representation, but it will soon be phased out, as it currently only returns the first romanization, not the most likely. Where the program seems to do well is with conversion between romanizations and general configurability.

In short then, the answer may be either any one of these, depending on what you need. Idiosyncratic proper nouns? Google Translate. In need of statistics? Adso. Willing to accept candidate lists without context information? Pinyin4j.

like image 189
NatHillard Avatar answered Sep 30 '22 11:09

NatHillard