In Python 3, Unicode strings are supposed to kindly give you the number of Unicode characters, but I can't figure out how to get the final display width of a string given that some characters combine.
Genesis 1:1 -- בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ
>>> len('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ')
60
But the string is only 37 characters wide. Normalization doesn't solve the problem because the vowels (dots underneath the larger characters) are distinct characters.
>>> len(unicodedata.normalize('NFC', 'בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ'))
60
As a side note: the textwrap
module is totally broken in this regard, aggressively wrapping where it shouldn't. str.format
seems similarly broken.
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.
Since Python 3.0, the language's str type contains Unicode characters, meaning any string created using "unicode rocks!"
unicodedata. normalize (form, unistr) Return the normal form form for the Unicode string unistr. Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'. The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence.
The problem is the combining characters, which Python counts as distinct when computing __len__
, but merge into a single printed character.
To find out whether a character is a combining character, we can use the unicodedata module:
unicodedata.combining(unichr)
Returns the canonical combining class assigned to the Unicode character unichr as integer. Returns 0 if no combining class is defined.
A naive solution is to just strip out any characters with a non-zero combining class. This leaves characters that stand on their own, and should give us a string with a 1-to-1 mapping between visible and underlying characters. (I am a Unicode novice, and it’s probably more complicated than that. There are subtleties with combining characters and grapheme extenders which I don’t really understand, but don’t seem to matter for this particular string.)
So I came up with this function:
import unicodedata
def visible_length(unistr):
'''Returns the number of printed characters in a Unicode string.'''
return len([char for char in unistr if unicodedata.combining(char) == 0])
which returns the correct length for your string:
>>> visible_length('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ')
37
This is probably not a complete solution for all Unicode strings, but depending on what subset of Unicode you’re working with, this may be enough for your needs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With