Python length of unicode string confusion

Tags:

unicode

There's been quite some help around this already, but I am still confused.

I have a unicode string like this:

title = u'😉test'
title_length = len(title) #5

But! I need len(title) to be 6. The clients expect it to be 6 because they seem to count in a different way than I do on the backend.

As a workaround I have written this little helper, but I am sure it can be improved (with enough knowledge about encodings) or perhaps it's even wrong.

title_length = len(title) + repr(title).count('\\U') #6

1. Is there a better way of getting the length to be 6? :-)

I assume me (Python) is counting the number of unicode characters which is 5. The clients are counting the number of bytes?

2. Would my logic break for other unicode characters that need 4 bytes for example?

Running Python 2.7 ucs4.

258

asked Jun 11 '15 08:06

1 Answers

You have 5 codepoints. One of those codepoints is outside of the Basic Multilingual Plane which means the UTF-16 encoding for those codepoints has to use two code units for the character.

In other words, the client is relying on an implementation detail, and is doing something wrong. They should be counting codepoints, not codeunits. There are several platforms where this happens quite regularly; Python 2 UCS2 builds are one such, but Java developers often forget about the difference, as do Windows APIs.

You can encode your text to UTF-16 and divide the number of bytes by two (each UTF-16 code unit is 2 bytes). Pick the utf-16-le or utf-16-be variant to not include a BOM in the length:

title = u'😉test'
len_in_codeunits = len(title.encode('utf-16-le')) // 2

If you are using Python 2 (and judging by the u prefix to the string you may well be), take into account that there are 2 different flavours of Python, depending on how you built it. Depending on a build-time configuration switch you'll either have a UCS-2 or UCS-4 build; the former uses surrogates internally too, and your title value length will be 6 there as well. See Python returns length of 2 for single Unicode character string.

155

answered Sep 19 '22 11:09

Martijn Pieters

Related questions
                            
                                Cubic spline memory error
                            
                                How to do linear regression, taking errorbars into account?
                            
                                two's complement of numbers in python
                            
                                Python lxml.etree - Is it more effective to parse XML from string or directly from link?
                            
                                Set value multiindex Pandas
                            
                                Ignore a column while building a model with SKLearn
                            
                                Django serving media files (user uploaded files ) in openshift
                            
                                Why doesn't pytz localize() produce a datetime object with tzinfo matching the tz object that localized it?
                            
                                How to disable a Combobox in Tkinter?
                            
                                One-to-many Flask | SQLAlchemy
                            
                                Python: fastest way to write pandas DataFrame to Excel on multiple sheets
                            
                                Why `print` content doesn't show immediately in terminal? [duplicate]
                            
                                How to really test signal handling in Python?
                            
                                sql select group by a having count(1) > 1 equivalent in python pandas?
                            
                                How to deal with rounding errors in Shapely
                            
                                What is the largest number the Decimal class can handle?
                            
                                Bradley adaptive thresholding algorithm
                            
                                How can I improve PySerial read speed
                            
                                How to write a function which takes a slice?
                            
                                Cyclical Sliding Window Iteration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python length of unicode string confusion

Tags:

python

unicode

kev

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us