If I take the length of a string containing a character outside the 7-bit ASCII table, I get different results on Windows and Linux: <pre class="prettyprint"><code>Windows: strlen("ö") = 1 Linux: strlen("ö") = 2 </code></pre> On a Windows machine the string is obviously encoded in the "extended" ascii format as <code>0xF6</code>, whereas on a Linux machine it gets encoded in UTF-8 with <code>0xC3 0x96</code>, which gives the length of 2 characters. <h3>Question:</h3> Why does a C string gets differently encoded on a Windows and a Linux machine? <hr> The question came up in a discussion I had with a fellow forum member on Code Review (see this thread).

<blockquote> Why does a C string gets differently encoded on a Windows and a Linux machine? </blockquote> First, this is not a Windows/Linux (Operating Systems) issue, but a compiler one as compilers exist on Windows that encode like gcc (common on Linux). This is allowed by C and the two compiler makers have charted different implementations per their own programing goals, MS using CP-1252 and Linux using Unicode. @Danh. MS's selection pre-dates Unicode. Not surprising that various compilers makers employ different solutions. <blockquote> 5.2.1 Character sets 1 Two sets of characters and their associated collating sequences shall be deﬁned: the set in which source ﬁles are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-speciﬁc members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-deﬁned. C11dr §5.2.1 1 (My emphasis) </blockquote> <pre class="prettyprint"><code>strlen("ö") = 1 strlen("ö") = 2 </code></pre> <code>"ö"</code> is encoded per the compiler's source character extended characters. I suspect MS is focused on maintaining their code base and encourages other languages. Linux is simply an earlier adapter of Unicode into C, even though MS has been an early Unicode influencer. As Unicode support grows, I expect that to be the solution of the future.

C String encoding Windows/Linux

Tags:

c

c-strings

If I take the length of a string containing a character outside the 7-bit ASCII table, I get different results on Windows and Linux:

Windows: strlen("ö") = 1
Linux:   strlen("ö") = 2

On a Windows machine the string is obviously encoded in the "extended" ascii format as 0xF6, whereas on a Linux machine it gets encoded in UTF-8 with 0xC3 0x96, which gives the length of 2 characters.

Question:

Why does a C string gets differently encoded on a Windows and a Linux machine?

The question came up in a discussion I had with a fellow forum member on Code Review (see this thread).

394

asked Dec 24 '16 01:12

Frode Akselsen

1 Answers

Why does a C string gets differently encoded on a Windows and a Linux machine?

First, this is not a Windows/Linux (Operating Systems) issue, but a compiler one as compilers exist on Windows that encode like gcc (common on Linux).

This is allowed by C and the two compiler makers have charted different implementations per their own programing goals, MS using CP-1252 and Linux using Unicode. @Danh. MS's selection pre-dates Unicode. Not surprising that various compilers makers employ different solutions.

5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be deﬁned: the set in which source ﬁles are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-speciﬁc members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-deﬁned. C11dr §5.2.1 1 (My emphasis)

strlen("ö") = 1
strlen("ö") = 2

"ö" is encoded per the compiler's source character extended characters.

I suspect MS is focused on maintaining their code base and encourages other languages. Linux is simply an earlier adapter of Unicode into C, even though MS has been an early Unicode influencer.

As Unicode support grows, I expect that to be the solution of the future.

123

answered Sep 26 '22 14:09

chux - Reinstate Monica

Related questions
                            
                                C preprocessing fails to stop immediately after an #error
                            
                                call a vararg function with an array?
                            
                                Segmentation fault when using regexec/strtok_r in C
                            
                                GCC 5.1 Loop unrolling
                            
                                Memory allocation optimization: from heap to stack
                            
                                What to use instead of magic numbers in C [duplicate]
                            
                                Nativeint Bigarray seems to be unsigned
                            
                                Is python smart enough to replace function calls with constant result?
                            
                                What are some good guidelines for choosing the size of integer types?
                            
                                SIMD versions of SHLD/SHRD instructions
                            
                                Passing volatile variable as constant argument in a function
                            
                                Linking a Static C Library in Xcode 7?
                            
                                Accurate computation of scaled complementary error function, erfcx()
                            
                                C function name or function pointer? [duplicate]
                            
                                Partially sorting an array C
                            
                                What happens if an invalid address is prefetched?
                            
                                LLDB - setting source code path
                            
                                Is rename required by standard to be atomic?
                            
                                Is there a portable/standard-compliant way to get filenames and linenumbers in a stack trace?
                            
                                Convert string to non string with C macro [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With