Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 Width Display Issue of Chinese Characters

When I use Perl or C to printf some data, I tried their format to control the width of each column, like

printf("%-30s", str);

But when str contains Chinese character, then the column doesn't align as expected. see the attachment picture.

My ubuntu's charset encoding is zh_CN.utf8, as far as I know, utf-8 encoding has 1~4 length of bytes. Chinese character has 3 bytes. In my test, I found printf's format control count a Chinese character as 3, but it actually displays as 2 ascii width.

So the real display width is not a constant as expected but a variable related to the number of Chinese character, i.e.

Sw(x) = 1 * (w - 3x) + 2 * x = w - x

w is the width limit expected, x is the count of Chinese characters, Sw(x) is the real display width.

So the more Chinese character str contains, the shorter it displays.

How can I get what I want? Count the Chinese characters before printf?

As far as I know, all Chinese or even all wide characters I guess, displays as 2 width, then why printf count it as 3? UTF-8's encoding has nothing to do with display length.

like image 348
gpanda Avatar asked May 25 '12 09:05

gpanda


1 Answers

Yes, this is a problem with all versions of printf that I am aware of. I briefly discuss the matter in this answer and also in this one.

For C, I do not know of a library that will do this for you, but if anyone has it, it would be ICU.

For Perl, you have to use the Unicode::GCString module form CPAN to calculate the number of print columns a Unicode string will take up. This takes into account Unicode Standard Annex #11: East Asian Width.

For example, some code points take up 1 column and others take up 2 columns. There are even some that take up no columns at all, like combining characters and invisible control characters. The class has a columns method that returns how many columns the string takes up.

I have an example of using this for aligning Unicode text vertically here. It will sort a bunch of Unicode strings, including some with combining characters and “wide” Asian ideograms (CJK characters), and allow you to align things vertically.

sample terminal output

Code for the little umenu demo program which prints that nicely aligned output, is included below.

You might also be interested the far more ambitious Unicode::LineBreak module, of which the aforementioned Unicode::GCString class is just a smaller component. This module is much cooler, and takes into account Unicode Standard Annex #14: Unicode Line Breaking Algorithm.

Here’s the code for the little umenu demo, tested on Perl v5.14:

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "γύρος"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "linguiça"          => 7.00, # spicy sausage, Portuguese
     "xoriço"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "éclair"            => 1.60, # dessert, French
     "smørbrød"          => 5.75, # sandwiches, Norwegian
     "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
     "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jamón serrano"     => 4.45, # country ham, Spanish
     "pêches"            => 2.25, # peaches, French
     "シュークリーム"    => 1.85, # cream-filled pastry like éclair, Japanese
     "막걸리"            => 4.00, # makgeolli, Korean rice wine
     "寿司"              => 9.99, # sushi, Japanese
     "おもち"            => 2.65, # omochi, rice cakes, Japanese
     "crème brûlée"      => 2.00, # tasty broiled cream, French
     "fideuà"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "pâté"              => 4.15, # gooseliver paste, French
     "お好み焼き"        => 8.00, # okonomiyaki, Japanese
 );

 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = new Unicode::Collate::Locale locale => "ja";

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " €%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

As you see, the key to making it work in that particular program is this line of code, which just calls other functions defined above, and uses the module I was discussing:

print pad(entitle($item), $width, ".");

That will pad out the item to the given width using dots as the fill character.

Yes, it’s a lot less convenient that printf, but at least it is possible.

like image 140
tchrist Avatar answered Oct 19 '22 06:10

tchrist