Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C String encoding Windows/Linux

Tags:

c

c-strings

If I take the length of a string containing a character outside the 7-bit ASCII table, I get different results on Windows and Linux:

Windows: strlen("ö") = 1
Linux:   strlen("ö") = 2

On a Windows machine the string is obviously encoded in the "extended" ascii format as 0xF6, whereas on a Linux machine it gets encoded in UTF-8 with 0xC3 0x96, which gives the length of 2 characters.

Question:

Why does a C string gets differently encoded on a Windows and a Linux machine?


The question came up in a discussion I had with a fellow forum member on Code Review (see this thread).

like image 394
Frode Akselsen Avatar asked Dec 24 '16 01:12

Frode Akselsen


People also ask

What encoding does C use for strings?

UTF-8 and Shift JIS are often used in C byte strings, while UTF-16 is often used in C wide strings when wchar_t is 16 bits.

What text encoding does Linux use?

Linux represents Unicode using the 8-bit Unicode Transformation Format (UTF-8). UTF-8 is a variable length encoding of Unicode. It uses 1 byte to code 7 bits, 2 bytes for 11 bits, 3 bytes for 16 bits, 4 bytes for 21 bits, 5 bytes for 26 bits, 6 bytes for 31 bits.

What is a string encoding?

In Java, when we deal with String sometimes it is required to encode a string in a specific character set. Encoding is a way to convert data from one format to another. String objects use UTF-16 encoding. The problem with UTF-16 is that it cannot be modified.


1 Answers

Why does a C string gets differently encoded on a Windows and a Linux machine?

First, this is not a Windows/Linux (Operating Systems) issue, but a compiler one as compilers exist on Windows that encode like gcc (common on Linux).

This is allowed by C and the two compiler makers have charted different implementations per their own programing goals, MS using CP-1252 and Linux using Unicode. @Danh. MS's selection pre-dates Unicode. Not surprising that various compilers makers employ different solutions.

5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined. C11dr §5.2.1 1 (My emphasis)

strlen("ö") = 1
strlen("ö") = 2

"ö" is encoded per the compiler's source character extended characters.

I suspect MS is focused on maintaining their code base and encourages other languages. Linux is simply an earlier adapter of Unicode into C, even though MS has been an early Unicode influencer.

As Unicode support grows, I expect that to be the solution of the future.

like image 123
chux - Reinstate Monica Avatar answered Sep 26 '22 14:09

chux - Reinstate Monica