Not by word boundaries, that is solvable.
Example:
#!/usr/bin/env python3
text = 'เมื่อแรกเริ่ม'
for char in text:
print(char)
This produces:
เ
ม
อ
แ
ร
ก
เ
ร
ม
Which obviously is not the desired output. Any ideas?
A portable representation of text is:
text = u'\u0e40\u0e21\u0e37\u0e48\u0e2d\u0e41\u0e23\u0e01\u0e40\u0e23\u0e34\u0e48\u0e21'
tl;dr: Use \X
regular expression to extract user-perceived characters:
>>> import regex # $ pip install regex
>>> regex.findall(u'\\X', u'เมื่อแรกเริ่ม')
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
While I do not know Thai, I know a little French.
Consider the letter è
. Let s
and s2
equal è
in the Python shell:
>>> s
'è'
>>> s2
'è'
Same letter? To a French speaker visually, oui. To a computer, no:
>>> s==s2
False
You can create the same letter either using the actual code point for è
or by taking the letter e
and adding a combining code point that adds that accent character. They have different encodings:
>>> s.encode('utf-8')
b'\xc3\xa8'
>>> s2.encode('utf-8')
b'e\xcc\x80'
And differnet lengths:
>>> len(s)
1
>>> len(s2)
2
But visually both encodings result in the 'letter' è
. This is called a grapheme, or what the end user considers one character.
You can demonstrate the same looping behavior you are seeing:
>>> [c for c in s]
['è']
>>> [c for c in s2]
['e', '̀']
Your string has several combining characters in it. Hence a 9 grapheme character Thai string to your eyes becomes a 13 character string to Python.
The solution in French is to normalize the string based on Unicode equivalence:
>>> from unicodedata import normalize
>>> normalize('NFC', s2) == s
True
That does not work for many non Latin languages though. An easy way to deal with unicode strings that may be multiple code points composing a single grapheme is with a regex engine that correctly deals with this by supporting \X
. Unfortunately Python's included re
module doesn't yet.
The proposed replacement, regex, does support \X
though:
>>> import regex
>>> text = 'เมื่อแรกเริ่ม'
>>> regex.findall(r'\X', text)
['เ', 'มื่', 'อ', 'แ', 'ร', 'ก', 'เ', 'ริ่', 'ม']
>>> len(_)
9
I cannot exactly reproduce, but here is a slight modified version of you script, with the output on IDLE 3.4 on a Windows7 64 system :
>>> for char in text:
print(char, hex(ord(char)), unicodedata.name(char),'-',
unicodedata.category(char), '-', unicodedata.combining(char), '-',
unicodedata.east_asian_width(char))
เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N
ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N
ื 0xe37 THAI CHARACTER SARA UEE - Mn - 0 - N
่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N
อ 0xe2d THAI CHARACTER O ANG - Lo - 0 - N
แ 0xe41 THAI CHARACTER SARA AE - Lo - 0 - N
ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N
ก 0xe01 THAI CHARACTER KO KAI - Lo - 0 - N
เ 0xe40 THAI CHARACTER SARA E - Lo - 0 - N
ร 0xe23 THAI CHARACTER RO RUA - Lo - 0 - N
ิ 0xe34 THAI CHARACTER SARA I - Mn - 0 - N
่ 0xe48 THAI CHARACTER MAI EK - Mn - 107 - N
ม 0xe21 THAI CHARACTER MO MA - Lo - 0 - N
>>>
I really do not know what those characters can be - my Thai is very poor :-) - but it shows that :
len(text)
(13
)If it is expected output, it proves that your problem is not in Python but more on the console where you display it. You should try to redirect output to a file, and then open the file in an unicode editor supporting Thai characters.
If expected output is only 9 characters, that is if you do not want to decompose composed characters, and provided there are no other composing rules that should be considered, you could use something like :
def Thaidump(t):
old = None
for i in t:
if unicodedata.category(i) == 'Mn':
if old is not None:
old = old + i
else:
if old is not None:
print(old)
old = i
print(old)
That way :
>>> Thaidump(text)
เ
มื่
อ
แ
ร
ก
เ
ริ่
ม
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With