Are there any scripts, libraries, or programs using <code>Python</code>, or <code>BASH</code> tools (e.g. <code>awk</code>, <code>perl</code>, <code>sed</code>) which can correctly convert numbered pinyin (e.g. dian4 nao3) to UTF-8 pinyin with tone marks (e.g. diàn nǎo)? I have found the following examples, but they require PHP or C#: <ul> <li>[PHP] Convert numbered to accentuated Pinyin? </li> <li>[C#] Any libraries to convert number Pinyin to Pinyin with tone markings? </li> </ul> I have also found various online tools, but they cannot handle a large number of conversions.

The cjklib library does cover your needs: Either use the Python shell: <pre class="prettyprint"><code>>>> from cjklib.reading import ReadingFactory >>> f = ReadingFactory() >>> print f.convert('Bei3jing1', 'Pinyin', 'Pinyin', sourceOptions={'toneMarkType': 'numbers'}) Běijīng </code></pre> Or just the command line: <pre class="prettyprint"><code>$ cjknife -m Bei3jing1 Běijīng </code></pre> Disclaimer: I developed that library.

Convert numbered pinyin to pinyin with tone marks

Tags:

python

bash

cjk

Are there any scripts, libraries, or programs using Python, or BASH tools (e.g. awk, perl, sed) which can correctly convert numbered pinyin (e.g. dian4 nao3) to UTF-8 pinyin with tone marks (e.g. diàn nǎo)?

I have found the following examples, but they require PHP or C#:

[PHP] Convert numbered to accentuated Pinyin?
[C#] Any libraries to convert number Pinyin to Pinyin with tone markings?

I have also found various online tools, but they cannot handle a large number of conversions.

389

asked Nov 20 '11 08:11

Village

3 Answers

I've got some Python 3 code that does this, and it's small enough to just put directly in the answer here.

PinyinToneMark = {
    0: "aoeiuv\u00fc",
    1: "\u0101\u014d\u0113\u012b\u016b\u01d6\u01d6",
    2: "\u00e1\u00f3\u00e9\u00ed\u00fa\u01d8\u01d8",
    3: "\u01ce\u01d2\u011b\u01d0\u01d4\u01da\u01da",
    4: "\u00e0\u00f2\u00e8\u00ec\u00f9\u01dc\u01dc",
}

def decode_pinyin(s):
    s = s.lower()
    r = ""
    t = ""
    for c in s:
        if c >= 'a' and c <= 'z':
            t += c
        elif c == ':':
            assert t[-1] == 'u'
            t = t[:-1] + "\u00fc"
        else:
            if c >= '0' and c <= '5':
                tone = int(c) % 5
                if tone != 0:
                    m = re.search("[aoeiuv\u00fc]+", t)
                    if m is None:
                        t += c
                    elif len(m.group(0)) == 1:
                        t = t[:m.start(0)] + PinyinToneMark[tone][PinyinToneMark[0].index(m.group(0))] + t[m.end(0):]
                    else:
                        if 'a' in t:
                            t = t.replace("a", PinyinToneMark[tone][0])
                        elif 'o' in t:
                            t = t.replace("o", PinyinToneMark[tone][1])
                        elif 'e' in t:
                            t = t.replace("e", PinyinToneMark[tone][2])
                        elif t.endswith("ui"):
                            t = t.replace("i", PinyinToneMark[tone][3])
                        elif t.endswith("iu"):
                            t = t.replace("u", PinyinToneMark[tone][4])
                        else:
                            t += "!"
            r += t
            t = ""
    r += t
    return r

This handles ü, u:, and v, all of which I've encountered. Minor modifications will be needed for Python 2 compatibility.

146

answered Sep 21 '22 01:09

Greg Hewgill

The cjklib library does cover your needs:

Either use the Python shell:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert('Bei3jing1', 'Pinyin', 'Pinyin', sourceOptions={'toneMarkType': 'numbers'})
Běijīng

Or just the command line:

$ cjknife -m Bei3jing1
Běijīng

Disclaimer: I developed that library.

answered Sep 22 '22 01:09

cburgmer

I wrote another Python function that does this, which is case insensitive and preserves spaces, punctuation and other text (unless there are false positives, of course):

# -*- coding: utf-8 -*-
import re

pinyinToneMarks = {
    u'a': u'āáǎà', u'e': u'ēéěè', u'i': u'īíǐì',
    u'o': u'ōóǒò', u'u': u'ūúǔù', u'ü': u'ǖǘǚǜ',
    u'A': u'ĀÁǍÀ', u'E': u'ĒÉĚÈ', u'I': u'ĪÍǏÌ',
    u'O': u'ŌÓǑÒ', u'U': u'ŪÚǓÙ', u'Ü': u'ǕǗǙǛ'
}

def convertPinyinCallback(m):
    tone=int(m.group(3))%5
    r=m.group(1).replace(u'v', u'ü').replace(u'V', u'Ü')
    # for multple vowels, use first one if it is a/e/o, otherwise use second one
    pos=0
    if len(r)>1 and not r[0] in 'aeoAEO':
        pos=1
    if tone != 0:
        r=r[0:pos]+pinyinToneMarks[r[pos]][tone-1]+r[pos+1:]
    return r+m.group(2)

def convertPinyin(s):
    return re.sub(ur'([aeiouüvÜ]{1,3})(n?g?r?)([012345])', convertPinyinCallback, s, flags=re.IGNORECASE)

print convertPinyin(u'Ni3 hao3 ma0?')

answered Sep 19 '22 01:09

dani_l

Related questions
                            
                                How to fix "unexpected keyword argument 'useChardet'" in html5lib
                            
                                Autoreconf failing when installing (py)COMPSs in a clusters
                            
                                getting last n items from queue
                            
                                Go to top frame in pdb (python debugger)
                            
                                pip freeze not showing packages
                            
                                Bokeh Range Set Only 1 Bound
                            
                                TensorFlow placement algorithm
                            
                                Python Pandas: Get index of multiple rows which column matches certain value
                            
                                Pandas count the number of times an event has occurred in last n days by group
                            
                                Surprised about good recursion performance in python
                            
                                What is the unit of height variable in "barh" of matplotlib?
                            
                                Python/Pandas - creating new variable based on several variables and if/elif/else function
                            
                                Making 1 milion requests with aiohttp/asyncio - literally
                            
                                Write Custom Python-Based Gradient Function for an Operation? (without C++ Implementation)
                            
                                python smallest range from multiple lists
                            
                                Weird behavior: Lambda inside list comprehension
                            
                                RelatedObjectDoesNotExist: User has no userprofile
                            
                                In python, if a function doesn't have a return statement, what does it return?
                            
                                Python shutil copytree: use ignore function to keep specific files types
                            
                                Update Counter collection in python with string, not letter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With