Are there any scripts, libraries, or programs using Python
, or BASH
tools (e.g. awk
, perl
, sed
) which can correctly convert numbered pinyin (e.g. dian4 nao3) to UTF-8 pinyin with tone marks (e.g. diàn nǎo)?
I have found the following examples, but they require PHP or C#:
I have also found various online tools, but they cannot handle a large number of conversions.
Type the correct vowel and it should appear below the tone mark. For example, to type pīn, type the following: To type the second tone enter "⌥ (option) + e". The third tone is "⌥ (option) + v", and the fourth tone is "⌥ (option) + ~" (the key under the Esc key).
I've got some Python 3 code that does this, and it's small enough to just put directly in the answer here.
PinyinToneMark = {
0: "aoeiuv\u00fc",
1: "\u0101\u014d\u0113\u012b\u016b\u01d6\u01d6",
2: "\u00e1\u00f3\u00e9\u00ed\u00fa\u01d8\u01d8",
3: "\u01ce\u01d2\u011b\u01d0\u01d4\u01da\u01da",
4: "\u00e0\u00f2\u00e8\u00ec\u00f9\u01dc\u01dc",
}
def decode_pinyin(s):
s = s.lower()
r = ""
t = ""
for c in s:
if c >= 'a' and c <= 'z':
t += c
elif c == ':':
assert t[-1] == 'u'
t = t[:-1] + "\u00fc"
else:
if c >= '0' and c <= '5':
tone = int(c) % 5
if tone != 0:
m = re.search("[aoeiuv\u00fc]+", t)
if m is None:
t += c
elif len(m.group(0)) == 1:
t = t[:m.start(0)] + PinyinToneMark[tone][PinyinToneMark[0].index(m.group(0))] + t[m.end(0):]
else:
if 'a' in t:
t = t.replace("a", PinyinToneMark[tone][0])
elif 'o' in t:
t = t.replace("o", PinyinToneMark[tone][1])
elif 'e' in t:
t = t.replace("e", PinyinToneMark[tone][2])
elif t.endswith("ui"):
t = t.replace("i", PinyinToneMark[tone][3])
elif t.endswith("iu"):
t = t.replace("u", PinyinToneMark[tone][4])
else:
t += "!"
r += t
t = ""
r += t
return r
This handles ü
, u:
, and v
, all of which I've encountered. Minor modifications will be needed for Python 2 compatibility.
The cjklib library does cover your needs:
Either use the Python shell:
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert('Bei3jing1', 'Pinyin', 'Pinyin', sourceOptions={'toneMarkType': 'numbers'})
Běijīng
Or just the command line:
$ cjknife -m Bei3jing1
Běijīng
Disclaimer: I developed that library.
I wrote another Python function that does this, which is case insensitive and preserves spaces, punctuation and other text (unless there are false positives, of course):
# -*- coding: utf-8 -*-
import re
pinyinToneMarks = {
u'a': u'āáǎà', u'e': u'ēéěè', u'i': u'īíǐì',
u'o': u'ōóǒò', u'u': u'ūúǔù', u'ü': u'ǖǘǚǜ',
u'A': u'ĀÁǍÀ', u'E': u'ĒÉĚÈ', u'I': u'ĪÍǏÌ',
u'O': u'ŌÓǑÒ', u'U': u'ŪÚǓÙ', u'Ü': u'ǕǗǙǛ'
}
def convertPinyinCallback(m):
tone=int(m.group(3))%5
r=m.group(1).replace(u'v', u'ü').replace(u'V', u'Ü')
# for multple vowels, use first one if it is a/e/o, otherwise use second one
pos=0
if len(r)>1 and not r[0] in 'aeoAEO':
pos=1
if tone != 0:
r=r[0:pos]+pinyinToneMarks[r[pos]][tone-1]+r[pos+1:]
return r+m.group(2)
def convertPinyin(s):
return re.sub(ur'([aeiouüvÜ]{1,3})(n?g?r?)([012345])', convertPinyinCallback, s, flags=re.IGNORECASE)
print convertPinyin(u'Ni3 hao3 ma0?')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With