How do you get the display width of combined Unicode characters in Python 3?

Tags:

In Python 3, Unicode strings are supposed to kindly give you the number of Unicode characters, but I can't figure out how to get the final display width of a string given that some characters combine.

Genesis 1:1 -- בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ

>>> len('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ')
60

But the string is only 37 characters wide. Normalization doesn't solve the problem because the vowels (dots underneath the larger characters) are distinct characters.

>>> len(unicodedata.normalize('NFC', 'בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ'))
60

As a side note: the textwrap module is totally broken in this regard, aggressively wrapping where it shouldn't. str.format seems similarly broken.

Similar question that was marked as a duplicate: Display width of unicode strings in Python
The question it was marked as a duplicate of only addresses normalization: Normalizing Unicode

639

asked Jun 17 '15 03:06

Conley Owens

1 Answers

The problem is the combining characters, which Python counts as distinct when computing __len__, but merge into a single printed character.

To find out whether a character is a combining character, we can use the unicodedata module:

unicodedata.combining(unichr)

Returns the canonical combining class assigned to the Unicode character unichr as integer. Returns 0 if no combining class is defined.

A naive solution is to just strip out any characters with a non-zero combining class. This leaves characters that stand on their own, and should give us a string with a 1-to-1 mapping between visible and underlying characters. (I am a Unicode novice, and it’s probably more complicated than that. There are subtleties with combining characters and grapheme extenders which I don’t really understand, but don’t seem to matter for this particular string.)

So I came up with this function:

import unicodedata

def visible_length(unistr):
    '''Returns the number of printed characters in a Unicode string.'''
    return len([char for char in unistr if unicodedata.combining(char) == 0])

which returns the correct length for your string:

>>> visible_length('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ')
37

This is probably not a complete solution for all Unicode strings, but depending on what subset of Unicode you’re working with, this may be enough for your needs.

126

answered Sep 22 '22 09:09

alexwlchan

Related questions
                            
                                How do I specify different compiler flags for just one Python/C extension source file?
                            
                                Python: ConfigParser.NoSectionError: No section: 'TestInformation'
                            
                                Reading Maven Pom xml in Python
                            
                                Detect Interpreter shut down in daemon thread
                            
                                Python data structure
                            
                                How to add namespace url to a django-rest-framework router viewset
                            
                                Creating numpy array of custom objects gives error "SystemError: error return without exception set"
                            
                                Python - Datetime not accounting for leap second properly?
                            
                                Writing bytes stream to s3 using python
                            
                                Creating a threshold-coded ROC plot in Python
                            
                                Python to JavaScript converter [closed]
                            
                                Python ggplot rotate axis labels
                            
                                How to set a timer & clear a timer?
                            
                                Memory Error with Multiprocessing in Python
                            
                                Do I need to explicitly pass multiprocessing.Queue instance variables to a child Process executing on an instance method?
                            
                                How do I return a value when @click.option is used to pass a command line argument to a function?
                            
                                Nu is infeasible
                            
                                Why can't I create a default, ordered dict by inheriting OrderedDict and defaultdict?
                            
                                AttributeError: 'unicode' object has no attribute 'values' when parsing JSON dictionary values
                            
                                Python classes: Inheritance vs Instantiation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you get the display width of combined Unicode characters in Python 3?

Tags:

python

python-3.x

unicode

Conley Owens

People also ask

1 Answers

alexwlchan

Recent Activity

Donate For Us