Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Formatting columns containing non-ascii characters

So I want to align fields containing non-ascii characters. The following does not seem to work:

for word1, word2 in [['hello', 'world'], ['こんにちは', '世界']]:
    print "{:<20} {:<20}".format(word1, word2)

hello                world
こんにちは      世界

Is there a solution?

like image 692
usual me Avatar asked Jan 07 '16 12:01

usual me


2 Answers

You are formatting a multi-byte encoded string. You appear to be using UTF-8 to encode your text and that encoding uses multiple bytes per codepoint (between 1 and 4 depending on the specific character). Formatting a string counts bytes, not codepoints, which is one reason why your strings end up misaligned:

>>> len('hello')
5
>>> len('こんにちは')
15
>>> len(u'こんにちは')
5

Format your text as Unicode strings instead, so that you can count codepoints, not bytes:

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    print u"{:<20} {:<20}".format(word1, word2)

Your next problem is that these characters are also wider than most; you have double-wide codepoints:

>>> import unicodedata
>>> unicodedata.east_asian_width(u'h')
'Na'
>>> unicodedata.east_asian_width(u'世')
'W'
>>> for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
...     print u"{:<20} {:<20}".format(word1, word2)
...
hello                world
こんにちは                世界

str.format() is not equipped to deal with that issue; you'll have to manually adjust your column widths before formatting based on how many characters are registered as wider in the Unicode standard.

This is tricky because there is more than one width available. See the East Asian Width Unicode standard annex; there are narrow, wide and ambigious widths; narrow is the width most other characters print at, wide is double that on my terminal. Ambiguous is... ambiguous as to how wide it'll actually be displayed:

Ambiguous characters require additional information not contained in the character code to further resolve their width.

It depends on the context how they are displayed; greek characters for example are displayed as narrow characters in a Western text, but wide in an East Asian context. My terminal displays them as narrow, but other terminals (configured for an east-asian locale, for example) may display them as wide instead. I'm not sure if there are any fool-proof ways of figuring out how that would work.

For the most part, you need to count characters with a 'W' or 'F' value for unicodedata.east_asian_width() as taking 2 positions; subtract 1 from your format width for each of these:

def calc_width(target, text):
    return target - sum(unicodedata.east_asian_width(c) in 'WF' for c in text)

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    print u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20,  word2))

This then produces the desired alignment in my terminal:

>>> for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
...     print u"{0:<{1}} {2:<{3}}".format(word1, calc_width(20, word1), word2, calc_width(20,  word2))
...
hello                world
こんにちは           世界

The slight misalignment you may see above is your browser or font using a different width ratio (not quite double) for the wide codepoints.

All this comes with a caveat: not all terminals support the East-Asian Width Unicode property, and display all codepoints at one width only.

like image 82
Martijn Pieters Avatar answered Oct 30 '22 04:10

Martijn Pieters


This is no easy task - this is not simply "non-ascii" - they are wide-unicode characters, and their displaying is quite tricky - and fundamentally depends more on the terminal type you are using than the number of spaces you put in there.

To start with, you have to use UNICODE strings. Since you are in Python 2, this means you should prefix your text-quotes with "u".

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    print "{:<20} {:<20}".format(word1, word2)

That way, Python can actually recognize each character inside the strings as a character, not as a collection of bytes that just are displayed back due to chance.

>>> a = u'こんにちは'
>>> len(a)
5
>>> b = 'こんにちは'
>>> len(b)
15

At first glance it looks like these lenghts could be used to calculate the character width. Unfortunatelly, this byte lenght of the utf--8 encoded characters is not related to the actual displayed width of the characters. Single width unicode characters are also multi-byte in utf-8 (like ç)

Now, once we are talking about unicode, Python does include some utilities - including a function call to know what is the display unit of each unicode-character - it is unicode.east_asian_width - this allows you to have a way to compute the width of each string and then to have proper spacing numbers:

The auto-calculation of the " {:

import unicode

def display_len(text):
    res = 0
    for char in text:
        res += 2 if unicodedata.east_asian_width(char) == 'W' else 1
    return res

for word1, word2 in [[u'hello', u'world'], [u'こんにちは', u'世界']]:
    width_format = u"{{}}{}{{}}".format(" " * (20 - (display_len(word1))))
    print width_format.format(word1, word2)

That has worked for me on my terminal:

hello              world
こんにちは          世界

But as Martijn puts it, it si more complicated than that. There are ambiguyous characters and terminal types. If you really need this text to be aligned in a text terminal, then you should use a terminal-library, like curses, whcih allow you to specify a display coordinate to print a string at. That way, you can simply position your cursor explictly on the appropriate column before printing each word, and avoid all display-width computation.

like image 23
jsbueno Avatar answered Oct 30 '22 04:10

jsbueno