Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are spelling variations of encoding identifiers for "setlocale" standardized or documented?

This question has to do with syntactic conventions for string encoding identifiers in locale names passed to setlocale in C, focusing on the particular example of UTF-8. My preliminary observation is that different commands on Ubuntu are not even consistent in this regard:

homer@orca1:~/Documents$ locale -a
C
C.utf8
en_AG
en_AG.utf8
en_AU.utf8

versus

homer@orca1:~/Documents$ localectl list-locales
C.UTF-8
en_AG.UTF-8
en_AU.UTF-8
en_BW.UTF-8
en_CA.UTF-8

My empirical conclusion (from just trying many variations) is that the glibc implementation of setlocale (on Ubuntu) is case-insensitive and insensitive to the presence or absence of a hyphen before the "8", but a hyphen before the "8" is the only accepted separator. So, the following all work:

setlocale(LC_ALL, "C.utf8");
setlocale(LC_ALL, "C.UTF-8");
setlocale(LC_ALL, "C.Utf-8");
setlocale(LC_ALL, "C.utF8");

But the following will not work:

setlocale(LC_ALL, "C.UTF_8");

However, I have no reason to assume that other C library implementations will have this exact same behavior with regard to accepted variations of spelling of "UTF-8".

So my question is:

  • Is there any kind of standardization for what spelling variations of encoding identifiers one can expect to be accepted by setlocale? For example, is "UTF-8" always guaranteed to work (assuming the underlying platform supports the UTF-8 encoding)?
  • For a particular C standard library implementation, is there some way other than trial and error to know what spelling variations will work?

(note: searching the web has produced nothing more helpful than verbiage which vaguely states that the behavior is "implementation-defined")

like image 746
NikS Avatar asked Oct 27 '25 19:10

NikS


1 Answers

Is there any kind of standardization for what spelling variations of encoding identifiers one can expect to be accepted by setlocale? For example, is "UTF-8" always guaranteed to work (assuming the underlying platform supports the UTF-8 encoding)?

The C language specification defines the "C" locale, and specifies that "" identifies an implementation-specific native locale. Any other locale names, and their significance, are implementation defined. This is a technical term deferring the decision to implementations (Glibc, Microsoft CRT, etc), and making it a conformance requirement for them to document their choice.

POSIX additionally defines meaning for a "POSIX" locale, but still leaves any other locale names implementation defined.

These documents do not define a significance for other-cased variants of those locale names. Conforming C standard library implementations are required to document the additional locale names they support, and that can include specifying them to be case insensitive. Even that does not inherently imply much in the way of standardization, however.

Glibc, for example, approaches that requirement in a parametric way, itself deferring to the specific system, with that manifesting in practice as looking up locale names as file names on the host system. This approach is inspired by the X/Open Portability Guide (XPG), so it is somewhat standard in that sense.

For a particular C standard library implementation, is there some way other than trial and error to know what spelling variations will work?

"Implementation defined" means that conforming implementations are obligated to document their choice, so you should consult the documentation of your implementation. As the Glibc example above shows, however, the documentation does not necessarily provide an explicit list or hard rules, though it will, generally, tell you some way in which you can find out the supported locale names. For Glibc, that would be examining the contents of the directories it searches for locale definitions.

There is no standard programmatic interface for enumerating available locale names, but on POSIX systems there is the locale command that you already know about. The localectl command is part of systemd, and so specific to a subset of Linux distributions.

like image 181
John Bollinger Avatar answered Oct 30 '25 08:10

John Bollinger