Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correct length of a string of non-English characters in Python3

I am given a string of Hebrew characters (and some other Arabic ones. I know neither of them) in a file

צוֹר‎

When I load this string from file in Python3

fin = open("filename")
x = next(fin).strip()

The length of x appears to be 5

>>> len(x)
5

Its unicode utf-8 encoding is

>>> x.encode("utf-8")
b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'

However, in browsers, it is clear that the length of these Hebrew characters is 3.

How to get the length properly? And why does this happen?

I am aware that Python 3 is by default unicode so I did not expect there is such an issue.

like image 363
Yo Hsiao Avatar asked Dec 24 '22 11:12

Yo Hsiao


2 Answers

The reason is the included text contains the control character \u200e which is an invisible character used as a Left-to-right mark (often used when you have multiple languages mixed to demarcate between the Left-to-Right and Right-to-Left). Additionally, it includes the vowel "character" (the little dot above the second character which shows how to pronounce it).

If you replace the LTR mark with the empty string for instance, you will get the length of 4:

>> x = 'צוֹר'
>> x
'צוֹר\u200e' # note the control character escape sequence
>> print(len(x))
5

>> print(len(x.replace('\u200e', ''))
4

If you want the length of strictly alphabetic character and space characters only, you could do something like re.sub out all non-space non-word characters:

>> print(len(re.sub('[^\w\s]', '', x)))
3
like image 178
lemonhead Avatar answered Dec 28 '22 07:12

lemonhead


Unicode characters have different categories. In your case:

>>> import unicodedata
>>> s = b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'.decode("utf-8")
>>> list(unicodedata.category(c) for c in s)
['Lo', 'Lo', 'Mn', 'Lo', 'Cf']
  • Lo: Letter, other (not uppercase, lowercase or such). These are "real" characters
  • Mn: Mark, nonspacing. This is some type of accent character combined with the previous character
  • Cf: Control, format. Here it switches back to left-to-right write direction
like image 36
Michael Butscher Avatar answered Dec 28 '22 05:12

Michael Butscher