Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

printf field width : bytes or chars?

Tags:

c

unicode

glibc

The printf/fprintf/sprintf family supports a width field in its format specifier. I have a doubt for the case of (non-wide) char arrays arguments:

Is the width field supposed to mean bytes or characters?

What is the (correct-de facto) behaviour if the char array corresponds to (say) a raw UTF-8 string? (I know that normally I should use some wide char type, that's not the point)

For example, in

char s[] = "ni\xc3\xb1o";  // utf8 encoded "niño"
fprintf(f,"%5s",s);

Is that function supposed to try to ouput just 5 bytes (plain C chars) (and you take responsability of misalignments or other problems if two bytes results in a textual characters) ?

Or is it supposed to try to compute the length of "textual characters" of the array? (decodifying it... according to the current locale?) (in the example, this would amount to find out that the string has 4 unicode chars, so it would add a space for padding).

UPDATE: I agree with the answers, it is logical that the printf family doesnt distinguish plain C chars from bytes. The problem is my glibc doest not seem to fully respect this notion, if the locale has been set previously, and if one has the (today most used) LANG/LC_CTYPE=en_US.UTF-8

Case in point:

#include<stdio.h>
#include<locale.h>
main () {
        char * locale = setlocale(LC_ALL, ""); /* I have LC_CTYPE="en_US.UTF-8" */
        char s[] = {'n','i', 0xc3,0xb1,'o',0}; /* "niño" in utf8: 5 bytes, 4 unicode chars */
        printf("|%*s|\n",6,s); /* this should pad a blank - works ok*/
        printf("|%.*s|\n",4,s); /* this should eat a char - works ok */
        char s3[] = {'A',0xb1,'B',0}; /* this is not valid UTF8 */
        printf("|%s|\n",s3);     /* print raw chars - ok */
        printf("|%.*s|\n",15,s3);     /* panics (why???) */
}

So, even when a non-POSIX-C locale has been set, still printf seems to have the right notion for counting width: bytes (c plain chars) and not unicode chars. That's fine. However, when given a char array that is not decodable in his locale, it silently panics (it aborts - nothing is printed after the first '|' - without error messages)... only if it needs to count some width. I dont understand why it even tries to decode the string from utf-8, when it doesn need/have to. Is this a bug in glibc ?

Tested with glibc 2.11.1 (Fedora 12) (also glibc 2.3.6)

Note: it's not related to terminal display issues - you can check the output by piping to od : $ ./a.out | od -t cx1 Here's my output:

0000000   |       n   i 303 261   o   |  \n   |   n   i 303 261   |  \n
         7c  20  6e  69  c3  b1  6f  7c  0a  7c  6e  69  c3  b1  7c  0a
0000020   |   A 261   B   |  \n   |
         7c  41  b1  42  7c  0a  7c

UPDATE 2 (May 2015): This questionable behaviour has been fixed in newer versions of glibc (from 2.17, it seems). With glibc-2.17-21.fc19 it works ok for me.

like image 465
leonbloy Avatar asked Jan 23 '23 02:01

leonbloy


1 Answers

It will result in five bytes being output. And five chars. In ISO C, there is no distinction between chars and bytes. Bytes are not necessarily 8 bits, instead being defined as the width of a char.

The ISO term for an 8-bit value is an octet.

Your "niño" string is actually five characters wide in terms of the C environment (sans the null terminator, of course). If only four symbols show up on your terminal, that's almost certainly a function of the terminal, not C's output functions.

I'm not saying a C implementation couldn't handle Unicode. It could quite easily do UTF-32 if CHAR_BITS was defined as 32. UTF-8 would be harder since it's a variable length encoding but there are ways around almost any problem :-)


Based on your update, it seems like you might have a problem. However, I'm not seeing your described behaviour in my setup with the same locale settings. In my case, I'm getting the same output in those last two printf statements.

If your setup is just stopping output after the first | (I assume that's what you mean by abort but, if you meant the whole program aborts, that's much more serious), I would raise the issue with GNU (try your particular distributions bug procedures first). You've done all the important work such as producing a minimal test case so someone should even be happy to run that against the latest version if your distribution doesn't quite get there (most don't).


As an aside, I'm not sure what you meant by checking the od output. On my system, I get:

pax> ./qq | od -t cx1
0000000   |       n   i 303 261   o   |  \n   |   n   i 303 261   |  \n
         7c  20  6e  69  c3  b1  6f  7c  0a  7c  6e  69  c3  b1  7c  0a
0000020   |   A 261   B   |  \n   |   A 261   B   |  \n
         7c  41  b1  42  7c  0a  7c  41  b1  42  7c  0a
0000034

so you can see the output stream contains the UTF-8, meaning that it's the terminal program which must interpret this. C/glibc isn't modifying the output at all, so maybe I just misunderstood what you were trying to say.

Although I've just realised you may be saying that your od output has only the starting bar on that line as well (unlike mine which appears to not have the problem), meaning that it is something wrong within C/glibc, not something wrong with the terminal silently dropping the characters (in all honesty, I would expect the terminal to drop either the whole line or just the offending character (i.e., output |A) - the fact that you're just getting | seems to preclude a terminal problem). Please clarify that.

like image 90
paxdiablo Avatar answered Feb 01 '23 15:02

paxdiablo