Correct length of a string of non-English characters in Python3

Question

I am given a string of Hebrew characters (and some other Arabic ones. I know neither of them) in a file

צוֹר‎

When I load this string from file in Python3

fin = open("filename")
x = next(fin).strip()

The length of x appears to be 5

>>> len(x)
5

Its unicode utf-8 encoding is

>>> x.encode("utf-8")
b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'

However, in browsers, it is clear that the length of these Hebrew characters is 3.

How to get the length properly? And why does this happen?

I am aware that Python 3 is by default unicode so I did not expect there is such an issue.

lemonhead · Accepted Answer

The reason is the included text contains the control character \u200e which is an invisible character used as a Left-to-right mark (often used when you have multiple languages mixed to demarcate between the Left-to-Right and Right-to-Left). Additionally, it includes the vowel "character" (the little dot above the second character which shows how to pronounce it).

If you replace the LTR mark with the empty string for instance, you will get the length of 4:

>> x = 'צוֹר'
>> x
'צוֹר\u200e' # note the control character escape sequence
>> print(len(x))
5

>> print(len(x.replace('\u200e', ''))
4

If you want the length of strictly alphabetic character and space characters only, you could do something like re.sub out all non-space non-word characters:

>> print(len(re.sub('[^\w\s]', '', x)))
3

Michael Butscher · Answer

Unicode characters have different categories. In your case:

>>> import unicodedata
>>> s = b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'.decode("utf-8")
>>> list(unicodedata.category(c) for c in s)
['Lo', 'Lo', 'Mn', 'Lo', 'Cf']

Lo: Letter, other (not uppercase, lowercase or such). These are "real" characters
Mn: Mark, nonspacing. This is some type of accent character combined with the previous character
Cf: Control, format. Here it switches back to left-to-right write direction

Correct length of a string of non-English characters in Python3

Tags:

python

python-3.x

python-unicode

Yo Hsiao

2 Answers

lemonhead

Michael Butscher

Recent Activity

Donate For Us

Correct length of a string of non-English characters in Python3

Tags:

python

python-3.x

python-unicode

Yo Hsiao

2 Answers

lemonhead

Michael Butscher

Related questions

Recent Activity

Donate For Us