Is there a way to programmatically combine Korean unicode into one?

Question

Using a Korean Input Method Editor (IME), it's possible to type 버리 + 어 and it will automatically become 버려.

Is there a way to programmatically do that in Python?

>>> x, y = '버리', '어'
>>> z = '버려'
>>> ord(z[-1])
47140
>>> ord(x[-1]), ord(y)
(47532, 50612)

Is there a way to compute that 47532 + 50612 -> 47140?

Here's some more examples:

가보 + 아 -> 가봐

끝나 + ㄹ -> 끝날

Sangbok Lee · Accepted Answer

I'm a Korean. First, if you type 버리 + 어, it becomes 버리어 not 버려. 버려 is an abbreviation of 버리어 and it's not automatically generated. Also 가보아 cannot becomes 가봐 automatically during typing by the same reason.

Second, by contrast, 끝나 + ㄹ becomes 끝날 because 나 has no jongseong(종성). Note that one character of Hangul is made of choseong(초성), jungseong(중성), and jongseong. choseong and jongseong are a consonant, jungseong is a vowel. See more at Wikipedia. So only when there's no jongseong during typing (like 끝나), there's a chance that it can have jongseong(ㄹ).

If you want to make 버리 + 어 to 버려, you should implement some Korean language grammar like, especially for this case, abbreviation of jungseong. For example ㅣ + ㅓ = ㅕ, ㅗ + ㅏ = ㅘ as you provided. 한글 맞춤법 chapter 4. section 5 (I can't find English pages right now) defines abbreviation like this. It's possible, but not so easy job especially for non-Koreans.

Next, if what you want is just to make 끝나 + ㄹ to 끝날, it can be a relatively easy job since there're libraries which can handle composition and decomposition of choseong, jungseong, jongseong. In case of Python, I found hgtk. You can try like this (nonpractical code):

# hgtk methods take one character at a time
cjj1 = hgtk.letter.decompose('나')  # ('ㄴ', 'ㅏ', '')
cjj2 = hgtk.letter.decompose('ㄹ')  # ('ㄹ', '', '')
if cjj1[2]) == '' and cjj2[1]) == '':
    cjj = (cjj1[0], cjj1[1], cjj2[0])
    cjj2 = None

Still, without proper knowledge of Hangul, it will be very hard to get it done.

stovfl · Answer

You could use your own Translation table.
The drawback is you have to input all pairs manual or you have a file to get it from.
For instance:

# Sample Korean chars to map
k = [[('버리', '어'), ('버려')], [('가보', '아'), ('가봐')], [('끝나', 'ㄹ'), ('끝날')]]

class Korean(object):
    def __init__(self):
        self.map = {}

        for m in k:
            key = m[0][0] + m[0][1]
            self.map[hash(key)] = m[1]

    def __getitem__(self, item):
        return self.map[hash(item)]

    def translate(self, s):
        return [ self.map[hash(token)] for token in s]

if __name__ == '__main__':
    k_map = Korean()
    k_chars = [ m[0][0] + m[0][1] for m in  k]

    print('Input: %s' % k_chars)
    print('Output: %s' % k_map.translate(k_chars))

    one_char_3 = k[0][0][0] + k[0][0][1]
    print('%s = %s' % (one_char_3, k_map[ one_char_3 ]) )

Input: ['버리어', '가보아', '끝나ㄹ']
Output: ['버려', '가봐', '끝날']
버리어 = 버려

Tested with Python:3.4.2

Is there a way to programmatically combine Korean unicode into one?

Tags:

python

unicode

nlp

ime

korean-nlp

alvas

2 Answers

Sangbok Lee

stovfl

Recent Activity

Donate For Us

Is there a way to programmatically combine Korean unicode into one?

Tags:

python

unicode

nlp

ime

korean-nlp

alvas

2 Answers

Sangbok Lee

stovfl

Related questions

Recent Activity

Donate For Us