How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with str.format()
?
Motivating example: Printing a table of strings to the console. Some of the strings contain non-ASCII characters.
>>> for title in d.keys():
>>> print("{:<20} | {}".format(title, d[title]))
zootehni- | zooteh.
zootekni- | zootek.
zoothèque | zooth.
zooveterinar- | zoovet.
zoovetinstitut- | zoovetinst.
母 | 母母
>>> s = 'è'
>>> len(s)
2
>>> [ord(c) for c in s]
[101, 768]
>>> unicodedata.name(s[1])
'COMBINING GRAVE ACCENT'
>>> s2 = '母'
>>> len(s2)
1
As can be seen, str.format()
simply takes the number of code-points in the string (len(s)
) as its width, leading to skewed columns in the output. Searching through the unicodedata
module, I have not found anything suggesting a solution.
Unicode normalization can fix the problem for è, but not for Asian characters, which often have larger display width. Similarly, zero-width unicode characters exist (e.g. zero-width space for allowing line breaks within words). You can't work around these issues with normalization, so please do not suggest "normalize your strings".
Edit: Added info about normalization.
Edit 2: In my original dataset also have some European combining characters that don't result in a single code-point even after normalization:
zwemwater | zwemw.
zwia̢z- | zw.
>>> s3 = 'a\u0322' # The 'a + combining retroflex hook below' from zwiaz
>>> len(unicodedata.normalize('NFC', s3))
2
To include Unicode characters in your Python source code, you can use Unicode escape characters in the form \u0123 in your string. In Python 2. x, you also need to prefix the string literal with 'u'.
Remarks. If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.
To allow working with Unicode characters, Python 2 has a unicode type which is a collection of Unicode code points (like Python 3's str type). The line ustring = u'A unicode \u018e string \xf1' creates a Unicode string with 20 characters.
In python, text could be presented using unicode string or bytes. Unicode is a standard for encoding character. Unicode string is a python data structure that can store zero or more unicode characters.
The length of a single Unicode character as a Python str will always be 1, no matter how many bytes it occupies. The length of the same character encoded to bytes will be anywhere between 1 and 4.
Normal strings in Python are stored internally as 8-bit ASCII, while Unicode strings are stored as 16-bit Unicode. This allows for a more varied set of characters, including special characters from most languages in the world. I'll restrict my treatment of Unicode strings to the following −
This is a subtle but important feature of len(): The length of a single Unicode character as a Python str will always be 1, no matter how many bytes it occupies. The length of the same character encoded to bytes will be anywhere between 1 and 4.
The rules for translating a Unicode string into a sequence of bytes are called a character encoding, or just an encoding. The first encoding you might think of is using 32-bit integers as the code unit, and then using the CPU’s representation of 32-bit integers. In this representation, the string “Python” might look like this:
You have several options:
Some consoles support escape sequences for pixel-exact positioning of the cursor. Might cause some overprinting, though.
Historical note: This approach was used in the Amiga terminal to display images in a console window by printing a line of text and then advancing the cursor down by one pixel. The leftover pixels of the text line slowly built an image.
Create a table in your code which contains the real (pixel) widths of all Unicode characters in the font that is used in the console / terminal window. Use a UI framework and a small Python script to generate this table.
Then add code which calculates the real width of the text using this table. The result might not be a multiple of the character width in the console, though. Together with pixel-exact cursor movement, this might solve your issue.
Note: You'll have to add special handling for ligatures (fi, fl) and composites. Alternatively, you can load a UI framework without opening a window and use the graphics primitives to calculate the string widths.
Use the tab character (\t
) to indent. But that will only help if your shell actually uses the real text width to place the cursor. Many terminals will simply count characters.
Create a HTML file with a table and look at it in a browser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With