Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Behavior of extended bytes/characters in C/POSIX locale

C and POSIX both require only a very limited set of characters be present in the C/POSIX locale, but allow additional characters to exist. This leaves a great deal of freedom to the implementation; for instance, supporting all of Unicode (as UTF-8) in the C locale is conforming behavior. However, most historical implementations treat the C locale as having an "8-bit-clean" single-byte character encoding, either ISO-8859-1 (Latin-1) or a sort of "abstract 8-bit character set" where the non-ASCII bytes are abstract characters with no particular identity. (However, in the latter case, if the compiler defines __STDC_ISO_10646__, they normatively correspond to Unicode characters, usually the Latin-1 range.)

Another conforming option that seems much less popular is to treat all non-ASCII bytes as non-characters, i.e. respond to them with an EILSEQ error.

What I'm interested in knowing is whether there are implementations which take this or any other unusual options in implementing the C locale. Are there implementations where attempting to convert "high bytes" in the C locale results in EILSEQ or anything other than treating them as (abstract or Latin-1) single-byte characters or UTF-8?

like image 459
R.. GitHub STOP HELPING ICE Avatar asked Mar 26 '13 23:03

R.. GitHub STOP HELPING ICE


2 Answers

From your comment to the previous answer:

The ways in which the assumption could be wrong are basically that bytes outside the portable character set could be illegal non-character bytes (EILSEQ) or make up some multibyte encoding (UTF-8 or a stateless legacy CJK encoding)

Here you can find one example.

Plan 9 only supports the "C" locale. As you can see in utf.c and rune.c, when it find a rune outside the portable characters, it simply handles it as a character from a different encoding.

Another candidates could be Minix and the *BSD family (as far as they use citrus). In the Minix source code I've also found the file command looking for new encoding when the character size is not 8bit.

like image 160
Giacomo Tesio Avatar answered Sep 29 '22 11:09

Giacomo Tesio


Amusingly, I just found that the most widely-used implementation, glibc, is an example of what I'm looking for. Consider this simple program:

#include <stdlib.h>
#include <stdio.h>
int main()
{
        wchar_t wc = 0;
        int n = mbtowc(&wc, "\x80", 1);
        printf("%d %.4x\n", n, (int)wc);
}

On glibc, it prints -1 0000. If the byte 0x80 were an extended character in the implementation's C/POSIX locale, it would print 1 followed by some nonzero character number.

Thus, the "common knowledge" that the C/POSIX locale is "8-bit-clean" on glibc is simply false. What's going on is that there's a gross inconsistency; despite the fact that all the standard utilities, regular expression matching, etc. are specified to operate on (multibyte) characters as if read by mbrtowc, the implementations of these utilities/functions are taking a shortcut when they see MB_CUR_MAX==1 or LC_CTYPE containing "C" (or similar) and reading char values directly instead of processing input with mbrtowc or similar. This is leading to an inconsistency between the specified behavior (which, as their implementation of the C/POSIX locale is defined, would have to treat high bytes as illegal sequences) and the implementation behavior (which is bypassing the locale system entirely).

With all that said, I am still looking for other implementations with the properties requested in the question.

like image 38
R.. GitHub STOP HELPING ICE Avatar answered Sep 29 '22 13:09

R.. GitHub STOP HELPING ICE