Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you get the display width of combined Unicode characters in Python 3?

In Python 3, Unicode strings are supposed to kindly give you the number of Unicode characters, but I can't figure out how to get the final display width of a string given that some characters combine.

Genesis 1:1 -- בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ

>>> len('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ')
60

But the string is only 37 characters wide. Normalization doesn't solve the problem because the vowels (dots underneath the larger characters) are distinct characters.

>>> len(unicodedata.normalize('NFC', 'בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ'))
60

As a side note: the textwrap module is totally broken in this regard, aggressively wrapping where it shouldn't. str.format seems similarly broken.

  • Similar question that was marked as a duplicate: Display width of unicode strings in Python
  • The question it was marked as a duplicate of only addresses normalization: Normalizing Unicode
like image 639
Conley Owens Avatar asked Jun 17 '15 03:06

Conley Owens


People also ask

How do you get Unicode representation in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

What does Unicode () do in Python?

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.

Does Python 3 have Unicode?

Since Python 3.0, the language's str type contains Unicode characters, meaning any string created using "unicode rocks!"

What is Unicode Normalization in Python?

unicodedata. normalize (form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'. The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence.


1 Answers

The problem is the combining characters, which Python counts as distinct when computing __len__, but merge into a single printed character.

To find out whether a character is a combining character, we can use the unicodedata module:

unicodedata.combining(unichr)

Returns the canonical combining class assigned to the Unicode character unichr as integer. Returns 0 if no combining class is defined.

A naive solution is to just strip out any characters with a non-zero combining class. This leaves characters that stand on their own, and should give us a string with a 1-to-1 mapping between visible and underlying characters. (I am a Unicode novice, and it’s probably more complicated than that. There are subtleties with combining characters and grapheme extenders which I don’t really understand, but don’t seem to matter for this particular string.)

So I came up with this function:

import unicodedata

def visible_length(unistr):
    '''Returns the number of printed characters in a Unicode string.'''
    return len([char for char in unistr if unicodedata.combining(char) == 0])

which returns the correct length for your string:

>>> visible_length('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ')
37

This is probably not a complete solution for all Unicode strings, but depending on what subset of Unicode you’re working with, this may be enough for your needs.

like image 126
alexwlchan Avatar answered Sep 22 '22 09:09

alexwlchan