Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which components use the locale variables?

I have read that every process has a set of locale variables associated with it. For example, these are the locale variables associated with the bash process on my system:

$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=

I want to know who actually uses these locale variables.

Do the C standard functions (for example: fwrite()) and the Linux system calls use them? Does the behavior of some C standard functions or some Linux system call differ depending on the value of some locale variable?

Or is it only certain programs that can use these locale variables? For example, I can write a program that will display messages to the user in a different language depending on the value of the LANG locale variable.

like image 448
Steve Avatar asked May 30 '18 14:05

Steve


People also ask

What are locale variables?

The locale environment variables tell the OS how to display or output certain kinds of text. They're prioritized, allowing us to influence which one(s) will come into play in various scenarios: LANGUAGE. LC_ALL. LC_xxx, while taking into account the locale category.

Which locale variable will override all locale category settings?

The LC_ALL variable can be used to override all of the LANG and LC* settings. See the locale(1) and setlocale(3C) man pages for more information.

What is locale specific?

A locale consists of a number of categories for which country-dependent formatting or other specifications exist. A program's locale defines its code sets, date and time formatting conventions, monetary conventions, decimal formatting conventions, and collation (sort) order.

What is a locale Linux?

Locale is basically a set of environmental variables that defines the user's language, region, and any special variant preferences that the user wants to see in their Linux interface.


2 Answers

By default, C's standard library functions use the "C" locale. You can switch it to the user locale to enable locale-specific:

  • Character handling
  • Collating
  • Date/time formatting
  • Numeric editing
  • Monetary formatting
  • Messaging

POSIX setlocale documentation contains an incomplete list of locale-dependent functions affected by it:

catopen, exec, fprintf, fscanf, isalnum, isalpha, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, iswalnum, iswalpha, iswblank, iswcntrl, iswctype, iswdigit, iswgraph, iswlower, iswprint, iswpunct, iswspace, iswupper, iswxdigit, isxdigit, localeconv, mblen, mbstowcs, mbtowc, newlocale, nl_langinfo, perror, psiginfo, setlocale, strcoll, strerror, strfmon, strftime, strsignal, strtod, strxfrm, tolower, toupper, towlower, towupper, uselocale, wcscoll, wcstod, wcstombs, wcsxfrm, wctomb

E.g.:

printf("%'d\n", 1000000000);
printf("Setting LC_ALL to %s\n", getenv("LANG"));
setlocale(LC_ALL, ""); // Set user-preferred locale.
printf("%'d\n", 1000000000);

Outputs:

1000000000
Setting LC_ALL to en_US.UTF-8
1,000,000,000
like image 155
Maxim Egorushkin Avatar answered Sep 30 '22 11:09

Maxim Egorushkin


I have read that every process has a set of locale variables associated with it.

That's not really true, or at least it is highly over-simplified.

Many standard library functions (and non-standard library functions) modify their behaviour based on a set of locale configurations which are maintained in some hidden global object within the standard library implementation. (In some library implementations, the locale configuration is maintained per-thread rather than globally, using thread-local static variables.) That may seem to be associated with a process, since typically each process has a single instance of the standard library's runtime, but it's important to understand that -- despite appearances -- locale support is part of the library, not the OS kernel. (Of course, nothing in any standard defines where the kernel's boundaries are, or even what a kernel might be. You could run your program "bare metal" or you might have an OS which considers it useful to implement the standard library within system calls. I'm talking here about common cases.)

Basic locale configuration is defined by the C standard in section 7.11 (of the C11 standard), which defines two interfaces:

  • setlocale, which modifies the library's locale configuration, and

  • localeconv, which queries part of the locale configuration, allowing user code to conform to the locale's numeric formatting conventions (including monetary formatting).

The locale configuration is divided into a number of more-or-less independent components, called "categories". (The C++ standard library calls these "facets", which is also a commonly-used word.) There are five categories defined by the C standard and one more defined by Posix, but the categories are open-ended; individual standard library implementations are free to add additional categories. For example, the Gnu standard C library used on most Linux systems currently has a total of 12 categories. (See man 7 locale on your system for a current list.)

The standard categories are:

  • LC_CTYPE: Character classification and case conversion.
  • LC_COLLATE: Collation order.
  • LC_MONETARY: Monetary formatting.
  • LC_NUMERIC: Numeric, non-monetary formatting.
  • LC_TIME: Date and time formats.

and the Posix extension is:

  • LC_MESSAGES: Formats of informative and diagnostic messages and interactive responses.

Aside from localeconv, which only provides access to specific configurations from the LC_NUMERIC and LC_MONETARY categories, there is no way to query any specific configuration.

Also, there is no standard way at all to set a single configuration. All you can do is use setlocale to configure an entire category, using a library-dependent and non-standardised locale name (which is just a character string). More precisely, two locale names are standardised:

  • The C standard defines the locale name C.

  • Posix defines the locale name POSIX. However, Posix specifies that the corresponding locale shall be identical to the locale named C.

The details for locale-naming are (or should be) detailed in the locale documentation for the environment you're working in, but normally a locale-aware program will never call setlocale with a string constant other than the standard names, or the empty string. (I'll get to that in a minute.)

The setlocale interface allows the program to set an individual locale category, or to set all locale categories to the same locale name. It also returns a string which can be used to return to a previously configured locale category (or complete configuration).

The category names shown in the list of categories above are macros defined in <locale.h>. An additional macro, LC_ALL, is also defined by that header file: LC_ALL. One of these macros must be used as the first argument to setlocale.

The C and Posix standards both require that the initial locale setting on program startup is the C locale. Many aspects of the C locale are standardised (and somewhat more aspects of the Posix locale are standardised). This standardisation allows a programmer to predict how numeric conversions will work, for example.

But it is often the case that a programmer will want to interact with the program's user with that user's own locale preferences. It is obviously not desirable that every single program have its own idiosyncratic mechanism for determining what the user's locale preferences are, so the standard library provides a mechanism for setting the locale (or individual locale categories) to whatever the default locale is configured to: calling setlocale with the empty string ("") as a locale name. The C standard does not specify any particular mechanism for configuring this information; it merely assumes that one exists.

(Side note: Calling setlocale with an empty string as locale name is not the same as calling setlocale with NULL as locale name. NULL tells setlocale to not change any locale setting, but it will still return the string associated with the current locale. This avoids the need for a getlocale interface.)

Posix does specify a mechanism for configuring user preferences, and it also insists that (most) standardised command-line utilities operate in the default locale. That mechanism uses environment variables whose names correspond to the setlocale category macros.

On a Posix implementation, when the program calls setlocale(LC_X, ""); the library will proceed to examine the current environment:

  1. First, it looks for the environment variable LC_ALL. If that is defined and has a non-empty value, it is used to define the locale.

  2. Otherwise, if the first argument to setlocale was not LC_ALL it looks for the environment variable whose name is the same as that argument. If that is defined and has a non-empty value, it is used to define the locale.

  3. Otherwise, if the environment variable LANG is defined and has a non-empty value, it is used (in some implementation dependent way) to construct a locale name. (LANG is supposed to indicate the user's language, which is an important part of their locale preferences.)

  4. Finally, some system-wide default is used.

Environment variables are generally initialised by the login program (or GUI equivalent) on the basis of system configuration files. (The precise mechanism varies from distribution to distribution and documentation is often difficult to find.)

As mentioned, almost all standard shell utilities are required by Posix to do the equivalent of setlocale(LC_ALL, ""); in order to operate in the user's configured locale. Every utility's manpage (or other documentation) should specify whether it does this or not, but it's reasonable to assume that it does unless there is some information to the contrary.

Also, many (but not all) standard library string functions are locale-aware. Library interfaces which are definitely not locale-aware include isdigit and isxdigit, which always respond on the basis of the C locale, and strcmp, which compares strings in the same way as memcmp, using the char value (interpreted as an unsigned int) to determine collation order. (strcoll is locale-aware, if you want to do comparison according to LC_COLLATE.) And the character encodings used for wide and multibyte characters are controlled (in some unspecified way) by the LC_CTYPE category.

like image 35
rici Avatar answered Sep 30 '22 09:09

rici