How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with <code>str.format()</code>? Motivating example: Printing a table of strings to the console. Some of the strings contain non-ASCII characters. <pre class="prettyprint"><code>>>> for title in d.keys(): >>> print("{:<20} | {}".format(title, d[title])) zootehni- | zooteh. zootekni- | zootek. zoothèque | zooth. zooveterinar- | zoovet. zoovetinstitut- | zoovetinst. 母 | 母母 >>> s = 'è' >>> len(s) 2 >>> [ord(c) for c in s] [101, 768] >>> unicodedata.name(s[1]) 'COMBINING GRAVE ACCENT' >>> s2 = '母' >>> len(s2) 1 </code></pre> As can be seen, <code>str.format()</code> simply takes the number of code-points in the string (<code>len(s)</code>) as its width, leading to skewed columns in the output. Searching through the <code>unicodedata</code> module, I have not found anything suggesting a solution. Unicode normalization can fix the problem for è, but not for Asian characters, which often have larger display width. Similarly, zero-width unicode characters exist (e.g. zero-width space for allowing line breaks within words). You can't work around these issues with normalization, so please do not suggest "normalize your strings". Edit: Added info about normalization. Edit 2: In my original dataset also have some European combining characters that don't result in a single code-point even after normalization: <pre class="prettyprint"><code> zwemwater | zwemw. zwia̢z- | zw. >>> s3 = 'a\u0322' # The 'a + combining retroflex hook below' from zwiaz >>> len(unicodedata.normalize('NFC', s3)) 2 </code></pre>

You have several options: <ol> <li> Some consoles support escape sequences for pixel-exact positioning of the cursor. Might cause some overprinting, though. Historical note: This approach was used in the Amiga terminal to display images in a console window by printing a line of text and then advancing the cursor down by one pixel. The leftover pixels of the text line slowly built an image. </li> <li> Create a table in your code which contains the real (pixel) widths of all Unicode characters in the font that is used in the console / terminal window. Use a UI framework and a small Python script to generate this table. Then add code which calculates the real width of the text using this table. The result might not be a multiple of the character width in the console, though. Together with pixel-exact cursor movement, this might solve your issue. Note: You'll have to add special handling for ligatures (fi, fl) and composites. Alternatively, you can load a UI framework without opening a window and use the graphics primitives to calculate the string widths. </li> <li>Use the tab character (<code>\t</code>) to indent. But that will only help if your shell actually uses the real text width to place the cursor. Many terminals will simply count characters.</li> <li>Create a HTML file with a table and look at it in a browser.</li> </ol>

Display width of unicode strings in Python [duplicate]

Tags:

python

string

width

unicode

python-unicode

How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with str.format()?

Motivating example: Printing a table of strings to the console. Some of the strings contain non-ASCII characters.

>>> for title in d.keys():
>>>     print("{:<20} | {}".format(title, d[title]))

    zootehni-           | zooteh.
    zootekni-           | zootek.
    zoothèque          | zooth.
    zooveterinar-       | zoovet.
    zoovetinstitut-     | zoovetinst.
    母                   | 母母

>>> s = 'è'
>>> len(s)
    2
>>> [ord(c) for c in s]
    [101, 768]
>>> unicodedata.name(s[1])
    'COMBINING GRAVE ACCENT'
>>> s2 = '母'
>>> len(s2)
    1

As can be seen, str.format() simply takes the number of code-points in the string (len(s)) as its width, leading to skewed columns in the output. Searching through the unicodedata module, I have not found anything suggesting a solution.

Unicode normalization can fix the problem for è, but not for Asian characters, which often have larger display width. Similarly, zero-width unicode characters exist (e.g. zero-width space for allowing line breaks within words). You can't work around these issues with normalization, so please do not suggest "normalize your strings".

Edit: Added info about normalization.

Edit 2: In my original dataset also have some European combining characters that don't result in a single code-point even after normalization:

    zwemwater     | zwemw.
    zwia̢z-       | zw.

>>> s3 = 'a\u0322'   # The 'a + combining retroflex hook below' from zwiaz
>>> len(unicodedata.normalize('NFC', s3))
    2

685

asked Mar 06 '14 13:03

Christian Aichinger

1 Answers

You have several options:

Some consoles support escape sequences for pixel-exact positioning of the cursor. Might cause some overprinting, though.

Historical note: This approach was used in the Amiga terminal to display images in a console window by printing a line of text and then advancing the cursor down by one pixel. The leftover pixels of the text line slowly built an image.
Create a table in your code which contains the real (pixel) widths of all Unicode characters in the font that is used in the console / terminal window. Use a UI framework and a small Python script to generate this table.

Then add code which calculates the real width of the text using this table. The result might not be a multiple of the character width in the console, though. Together with pixel-exact cursor movement, this might solve your issue.

Note: You'll have to add special handling for ligatures (fi, fl) and composites. Alternatively, you can load a UI framework without opening a window and use the graphics primitives to calculate the string widths.
Use the tab character (\t) to indent. But that will only help if your shell actually uses the real text width to place the cursor. Many terminals will simply count characters.
Create a HTML file with a table and look at it in a browser.

answered Oct 15 '22 16:10

Aaron Digulla

Related questions
                            
                                How to Communicate with a Chess engine in Python?
                            
                                C array to PyArray
                            
                                Python - How to get the start/base address of a process?
                            
                                Finding the roots of a large number of functions with one variable
                            
                                What is the "correct" way to make a stoppable thread in Python, given stoppable pseudo-atomic units of work?
                            
                                Why doesn't functools.partial return a real function (and how to create one that does)?
                            
                                Appropriate Python Exception Class for Missing Settings File
                            
                                A-star search in numpy or python
                            
                                How to start a command line command from Python [duplicate]
                            
                                Why is my python format %s taking no space?
                            
                                Extremely strange Web-Scraping issue: Post request not behaving as expected
                            
                                Python Boolean method naming readability [closed]
                            
                                Python: Can I use class variables as thread locks?
                            
                                Big d3.js graph, canvas or server-side rendering?
                            
                                Set up Notepad++ and NppExec to print unicode characters from python
                            
                                Where does Python's interactive prompt ">>>" output to?
                            
                                A single executable file with Py2Exe
                            
                                Python logging: Set handlers for all loggers of used modules
                            
                                How to make four-way logarithmic plot in Matplotlib?
                            
                                pyttsx compilation error in windows using py2xe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With