Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Character count of Unicode string [duplicate]

Tags:

python

unicode

How would I get the character count of the below in python?

s = 'הוא אוסף אתכם מחר בשלוש וחצי.'

Char count: 29
Char length: 52

len(s) = 52
? = 29
like image 574
David542 Avatar asked Jan 26 '15 22:01

David542


1 Answers

decode your byte string (according to whatever encoding it's in, utf-8 maybe) -- the len of the resulting Unicode string is what you're after.

If fact best practice is to decode inputs as soon as possible, deal only with actual text (i.e, unicode, in Python 2; it's just the way ordinary strings are, in Python 3) in your code, and if need be encode just as you're outputting again.

Byte strings should be handled in your program only if it's specifically about byte strings (e.g, controlling or monitoring some hardware device, &c) -- far more programs are about text, and thus, except where indispensable at some I/O boundaries, they should be exclusively dealing with text strings (spelled unicode in Python 2:-).

But if you do want to keep s as a bytestring nevertheless,

len(s.decode('utf-8'))

(or whatever other encoding you're using to represent text as byte strings) should still do what you request.

like image 62
Alex Martelli Avatar answered Sep 24 '22 18:09

Alex Martelli