Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting Thai text by characters

Not by word boundaries, that is solvable.

Example:

#!/usr/bin/env python3  
text = 'เมื่อแรกเริ่ม'  
for char in text:  
    print(char)  

This produces:






Which obviously is not the desired output. Any ideas?

A portable representation of text is:

text = u'\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e41\u0e23\u0e01\u0e40\u0e23\u0e34\u0e48\u0e21'
like image 707
josifoski Avatar asked May 07 '15 14:05

josifoski


2 Answers

tl;dr: Use \X regular expression to extract user-perceived characters:

>>> import regex # $ pip install regex
>>> regex.findall(u'\\X', u'เมื่อแรกเริ่ม')
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']

While I do not know Thai, I know a little French.

Consider the letter è. Let s and s2 equal è in the Python shell:

>>> s
'è'
>>> s2
'è'

Same letter? To a French speaker visually, oui. To a computer, no:

>>> s==s2
False

You can create the same letter either using the actual code point for è or by taking the letter e and adding a combining code point that adds that accent character. They have different encodings:

>>> s.encode('utf-8')
b'\xc3\xa8'
>>> s2.encode('utf-8')
b'e\xcc\x80'

And differnet lengths:

>>> len(s)
1
>>> len(s2)
2

But visually both encodings result in the 'letter' è. This is called a grapheme, or what the end user considers one character.

You can demonstrate the same looping behavior you are seeing:

>>> [c for c in s]
['è']
>>> [c for c in s2]
['e', '̀']

Your string has several combining characters in it. Hence a 9 grapheme character Thai string to your eyes becomes a 13 character string to Python.

The solution in French is to normalize the string based on Unicode equivalence:

>>> from unicodedata import normalize
>>> normalize('NFC', s2) == s
True

That does not work for many non Latin languages though. An easy way to deal with unicode strings that may be multiple code points composing a single grapheme is with a regex engine that correctly deals with this by supporting \X. Unfortunately Python's included re module doesn't yet.

The proposed replacement, regex, does support \X though:

>>> import regex
>>> text = 'เมื่อแรกเริ่ม'
>>> regex.findall(r'\X', text)
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
>>> len(_)
9
like image 119
dawg Avatar answered Oct 07 '22 16:10

dawg


I cannot exactly reproduce, but here is a slight modified version of you script, with the output on IDLE 3.4 on a Windows7 64 system :

>>> for char in text:
    print(char, hex(ord(char)), unicodedata.name(char),'-',
          unicodedata.category(char), '-', unicodedata.combining(char), '-',
          unicodedata.east_asian_width(char))


เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N
ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N
ื 0xe37 THAI CHARACTER SARA UEE - Mn - 0 - N
่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N
อ 0xe2d THAI CHARACTER O ANG - Lo - 0 - N
แ 0xe41 THAI CHARACTER SARA AE - Lo - 0 - N
ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N
ก 0xe01 THAI CHARACTER KO KAI - Lo - 0 - N
เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N
ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N
ิ 0xe34 THAI CHARACTER SARA I - Mn - 0 - N
่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N
ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N
>>>

I really do not know what those characters can be - my Thai is very poor :-) - but it shows that :

  • text is acknowledged to be Thai ...
  • output is coherent with len(text) (13)
  • category and combining are different when characters are combined

If it is expected output, it proves that your problem is not in Python but more on the console where you display it. You should try to redirect output to a file, and then open the file in an unicode editor supporting Thai characters.

If expected output is only 9 characters, that is if you do not want to decompose composed characters, and provided there are no other composing rules that should be considered, you could use something like :

def Thaidump(t):
    old = None
    for i in t:
        if unicodedata.category(i) == 'Mn':
            if old is not None:
                old = old + i
        else:
            if old is not None:
                print(old)
            old = i
    print(old)

That way :

>>> Thaidump(text)
เ
มื่
อ
แ
ร
ก
เ
ริ่
ม
>>> 
like image 37
Serge Ballesta Avatar answered Oct 07 '22 17:10

Serge Ballesta