Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3 len() function for Unicode characters

When we believe Python 3 got everything right on Unicode I am surprised while I faced this situation.

>>> amma = "அம்மா"
>>> amma
'அம்மா'
>>> len(amma)
5

Apparently the Tamil string "அம்மா" has 3 letters, A return value of 5 for len("அம்மா") in no way can be accepted or appreciated.

How are the other Dravidian or Brahmic scripts solve this issue to get the right string length?

Edit #1: Considering the comment of @joey this question can be rephrased as below.

How to calculate the grapheme length in Python?

We know Swift or Perl6 does this by default

  2> let amma = "அம்மா".characters.count
amma: Distance = 3
like image 350
nehem Avatar asked Oct 21 '25 07:10

nehem


1 Answers

It may have 3 letters, but it has 5 characters:

$ charinfo 'அம்மா'
U+0B85 TAMIL LETTER A [Lo]
U+0BAE TAMIL LETTER MA [Lo]
U+0BCD TAMIL SIGN VIRAMA [Mn]
U+0BAE TAMIL LETTER MA [Lo]
U+0BBE TAMIL VOWEL SIGN AA [Mc]

If you need to be more specific then you will need to only count the number of characters that are in the Letter category.

like image 195
Ignacio Vazquez-Abrams Avatar answered Oct 22 '25 23:10

Ignacio Vazquez-Abrams