Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Convert numbered pinyin to pinyin with tone marks





Are there any scripts, libraries, or programs using Python, or BASH tools (e.g. awk, perl, sed) which can correctly convert numbered pinyin (e.g. dian4 nao3) to UTF-8 pinyin with tone marks (e.g. diàn​ nǎo)?

I have found the following examples, but they require PHP or C#:

  • [PHP] Convert numbered to accentuated Pinyin?
  • [C#] Any libraries to convert number Pinyin to Pinyin with tone markings?

I have also found various online tools, but they cannot handle a large number of conversions.

like image 389
Village Avatar asked Nov 20 '11 08:11


People also ask

How do you type pinyin with tone marks on a Mac?

Type the correct vowel and it should appear below the tone mark. For example, to type pīn, type the following: To type the second tone enter "⌥ (option) + e". The third tone is "⌥ (option) + v", and the fourth tone is "⌥ (option) + ~" (the key under the Esc key).

3 Answers

I've got some Python 3 code that does this, and it's small enough to just put directly in the answer here.

PinyinToneMark = {
    0: "aoeiuv\u00fc",
    1: "\u0101\u014d\u0113\u012b\u016b\u01d6\u01d6",
    2: "\u00e1\u00f3\u00e9\u00ed\u00fa\u01d8\u01d8",
    3: "\u01ce\u01d2\u011b\u01d0\u01d4\u01da\u01da",
    4: "\u00e0\u00f2\u00e8\u00ec\u00f9\u01dc\u01dc",

def decode_pinyin(s):
    s = s.lower()
    r = ""
    t = ""
    for c in s:
        if c >= 'a' and c <= 'z':
            t += c
        elif c == ':':
            assert t[-1] == 'u'
            t = t[:-1] + "\u00fc"
            if c >= '0' and c <= '5':
                tone = int(c) % 5
                if tone != 0:
                    m = re.search("[aoeiuv\u00fc]+", t)
                    if m is None:
                        t += c
                    elif len(m.group(0)) == 1:
                        t = t[:m.start(0)] + PinyinToneMark[tone][PinyinToneMark[0].index(m.group(0))] + t[m.end(0):]
                        if 'a' in t:
                            t = t.replace("a", PinyinToneMark[tone][0])
                        elif 'o' in t:
                            t = t.replace("o", PinyinToneMark[tone][1])
                        elif 'e' in t:
                            t = t.replace("e", PinyinToneMark[tone][2])
                        elif t.endswith("ui"):
                            t = t.replace("i", PinyinToneMark[tone][3])
                        elif t.endswith("iu"):
                            t = t.replace("u", PinyinToneMark[tone][4])
                            t += "!"
            r += t
            t = ""
    r += t
    return r

This handles ü, u:, and v, all of which I've encountered. Minor modifications will be needed for Python 2 compatibility.

like image 146
Greg Hewgill Avatar answered Sep 21 '22 01:09

Greg Hewgill

The cjklib library does cover your needs:

Either use the Python shell:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert('Bei3jing1', 'Pinyin', 'Pinyin', sourceOptions={'toneMarkType': 'numbers'})

Or just the command line:

$ cjknife -m Bei3jing1

Disclaimer: I developed that library.

like image 41
cburgmer Avatar answered Sep 22 '22 01:09


I wrote another Python function that does this, which is case insensitive and preserves spaces, punctuation and other text (unless there are false positives, of course):

# -*- coding: utf-8 -*-
import re

pinyinToneMarks = {
    u'a': u'āáǎà', u'e': u'ēéěè', u'i': u'īíǐì',
    u'o': u'ōóǒò', u'u': u'ūúǔù', u'ü': u'ǖǘǚǜ',
    u'A': u'ĀÁǍÀ', u'E': u'ĒÉĚÈ', u'I': u'ĪÍǏÌ',
    u'O': u'ŌÓǑÒ', u'U': u'ŪÚǓÙ', u'Ü': u'ǕǗǙǛ'

def convertPinyinCallback(m):
    r=m.group(1).replace(u'v', u'ü').replace(u'V', u'Ü')
    # for multple vowels, use first one if it is a/e/o, otherwise use second one
    if len(r)>1 and not r[0] in 'aeoAEO':
    if tone != 0:
    return r+m.group(2)

def convertPinyin(s):
    return re.sub(ur'([aeiouüvÜ]{1,3})(n?g?r?)([012345])', convertPinyinCallback, s, flags=re.IGNORECASE)

print convertPinyin(u'Ni3 hao3 ma0?')
like image 26
dani_l Avatar answered Sep 19 '22 01:09
