Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What strategies exist for ensuring all locale-aware operations are handled correctly in all locales?

Somewhat out of necessity, I develop software with my locale set to either "C" or "en_US". It's difficult to use a different locale because I only speak one language with anything even remotely approaching fluency.

As a result, I often overlook the differences in behavior that can be introduced by having different locale settings.Unsurprisingly, overlooking those differences will sometimes lead to bugs which are only discovered by some unfortunate user using a different locale. In particularly bad cases, that user may not even share a language with me, making the bug reporting process a challenging one. And, importantly, a lot of my software is in the form of libraries; while almost none of it sets the locale, it may be combined with another library, or used in an application which does set the locale - generating behavior I never experience myself.

To be a bit more specific, the kinds of bugs I have in mind are not missing text localizations or bugs in the code for using those localizations. Instead, I mean bugs where a locale changes the result of some locale-aware API (for example, toupper(3)) when the code using that API did not anticipate the possibility of such a change (eg, in the Turkish locale, toupper does not change "i" to "I" - potentially a problem for a network server trying to speak a particular network protocol to another host).

A few examples of such bugs in software I maintain:

  • AttributeError in a Turkish locale
  • imap relies on a C locale for date formatting
  • Fix for locale-dependant date formatting in imap and conch

In the past, one approach I've taken to dealing with this is to write regression tests which explicitly change the locale to one where code was known not to work, exercise the code, verify correct behavior, and then restore the original locale. This works well enough, but only after someone has reported a bug, and it only covers one small area of a codebase.

Another approach which seems possible is to have a continuous integration system (CIS) set up to run a full suite of tests in an environment with a different locale set. This improves the situation somewhat, by giving as much coverage in that one alternate locale as the test suite normally gives. Another shortcoming is that there are many, many, many locales, and each may possibly cause different problems. In practice, there are probably only a dozen or so different ways a locale can break a program, but having dozens of extra testing configurations is taxing on resources (particularly for a project already stretching its resource limits by testing on different platforms, against different library versions, etc).

Another approach which occurred to me is to use (possibly first creating) a new locale which is radically different from the "C" locale in every way it can be - have a different case mapping, use a different thousands separator, format dates differently, etc. This locale could be used with one extra CIS configuration and hopefully relied upon to catch any errors in the code that could be triggered by any locale.

Does such a testing locale exist already? Are there flaws with this idea to testing for locale compatibility?

What other approaches to locale testing have people taken?

I'm primarily interested in POSIX locales, since those are the ones I know about. However, I know that Windows also has some similar features, so extra information (perhaps with more background information about how those features work), could perhaps also be useful.

like image 332
Jean-Paul Calderone Avatar asked Feb 28 '12 15:02

Jean-Paul Calderone


1 Answers

I would just audit your code for incorrect uses of functions like toupper. Under the C locale model, such functions should be considered as operating only on natural-language text in the locale's language. For any application which deals with potentially multi-lingual text, this means functions such as tolower should not be used at all.

If your target is POSIX, you have a little bit more flexibility due to the uselocale function which makes it possible to temporarily override the locale in a single thread (i.e. without messing up the global state of your program). You could then keep the C locale globally and use tolower etc. for ASCII/machine-oriented text (like config files and such) and only uselocale to the user's selected locale when working with natural-language text from said locale.

Otherwise (and perhaps even then if you needs are more advanced), I think the best solution is to completely throw out functions like tolower and write your own ASCII versions for config text and the like, and use a powerful Unicode-aware library for natural-language text.

One sticky issue that I haven't yet touched on is the decimal separator in relation to functions like snprintf and strtod. Having it changed to a , instead of a . in some locales can ruin your ability to parse files with the C library. My preferred solution is simply to never set the LC_NUMERIC locale whatsoever. (And I'm a mathematician so I tend to believe numbers should be universal, not subject to cultural convention.) Depending on your application, the only locale categories really needed may just be LC_CTYPE, LC_COLLATE, and LC_MESSAGES. Also often useful are LC_MONETARY and LC_TIME.

like image 61
R.. GitHub STOP HELPING ICE Avatar answered Oct 05 '22 23:10

R.. GitHub STOP HELPING ICE