Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Internal and external encoding vs. Unicode

Tags:

c++

c

posix

windows

Since there was a lot of missinformation spread by several posters in the comments for this question: C++ ABI issues list

I have created this one to clarify.

  1. What are the encodings used for C style strings?
  2. Is Linux using UTF-8 to encode strings?
  3. How does external encoding relate to the encoding used by narrow and wide strings?
like image 318
Šimon Tóth Avatar asked Dec 06 '25 06:12

Šimon Tóth


1 Answers

  1. Implementation defined. Or even application defined; the standard doesn't really put any restrictions on what an application does with them, and expects a lot of the behavior to depend on the locale. All that is really implemenation defined is the encoding used in string literals.

  2. In what sense. Most of the OS ignores most of the encodings; you'll have problems if '\0' isn't a nul byte, but even EBCDIC meets that requirement. Otherwise, depending on the context, there will be a few additional characters which may be significant (a '/' in path names, for example); all of these use the first 128 encodings in Unicode, so will have a single byte encoding in UTF-8. As an example, I've used both UTF-8 and ISO 8859-1 for filenames under Linux. The only real issue is displaying them: if you do ls in an xterm, for example, ls and the xterm will assume that the filenames are in the same encoding as the display font.

  3. That mainly depends on the locale. Depending on the locale, it's quite possible for the internal encoding of a narrow character string not to correspond to that used for string literals. (But how could it be otherwise, since the encoding of a string literal must be determined at compile time, where as the internal encoding for narrow character strings depends on the locale used to read it, and can vary from one string to the next.)

If you're developing a new application in Linux, I would strongly recommend using Unicode for everything, with UTF-32 for wide character strings, and UTF-8 for narrow character strings. But don't count on anything outside the first 128 encoding points working in string literals.

like image 172
James Kanze Avatar answered Dec 08 '25 18:12

James Kanze