I am given a string of Hebrew characters (and some other Arabic ones. I know neither of them) in a file
צוֹר
When I load this string from file in Python3
fin = open("filename")
x = next(fin).strip()
The length of x
appears to be 5
>>> len(x)
5
Its unicode utf-8 encoding is
>>> x.encode("utf-8")
b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'
However, in browsers, it is clear that the length of these Hebrew characters is 3.
How to get the length properly? And why does this happen?
I am aware that Python 3 is by default unicode so I did not expect there is such an issue.
The reason is the included text contains the control character \u200e
which is an invisible character used as a Left-to-right mark (often used when you have multiple languages mixed to demarcate between the Left-to-Right and Right-to-Left). Additionally, it includes the vowel "character" (the little dot above the second character which shows how to pronounce it).
If you replace the LTR mark with the empty string for instance, you will get the length of 4:
>> x = 'צוֹר'
>> x
'צוֹר\u200e' # note the control character escape sequence
>> print(len(x))
5
>> print(len(x.replace('\u200e', ''))
4
If you want the length of strictly alphabetic character and space characters only, you could do something like re.sub
out all non-space non-word characters:
>> print(len(re.sub('[^\w\s]', '', x)))
3
Unicode characters have different categories. In your case:
>>> import unicodedata
>>> s = b'\xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e'.decode("utf-8")
>>> list(unicodedata.category(c) for c in s)
['Lo', 'Lo', 'Mn', 'Lo', 'Cf']
Lo
: Letter, other (not uppercase, lowercase or such). These are "real" charactersMn
: Mark, nonspacing. This is some type of accent character combined with the previous characterCf
: Control, format. Here it switches back to left-to-right write directionIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With