Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The C stdio character encoding

For my pet project I am experimenting with string representations, but I arrived to some troubling results. Firstly, here is a short application:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const char* c, size_t len)
{
    void* t = (void*)c;
    fwrite(&len, sizeof(size_t), 1, fp);
    fwrite(t, len, sizeof(char), fp);
}
int main()
{
    FILE* fp = fopen("test.cod", "wb+");
    const char* ABCDE = "ABCDE";
    write_to_file(fp, ABCDE, strlen(ABCDE) );
    const char* nor = "BBøæåBB";
    write_to_file(fp, nor, strlen(nor));
    const char* hun = "AAőűéáöüúBB";
    write_to_file(fp, hun, strlen(hun));
    const char* per = "CCبﺙگCC";
    write_to_file(fp, per, strlen(per));
    fclose(fp);
}

It does nothing special, just takes in a string, and writes it's length and the string itself to a file. Now, the file, when viewed as hex, looks like:

hex dump of standard char* output

I am happy with the first result, 5 (the first 8 bytes, I'm on a 64 bit machine) as expected. However, the nor variable in my expectation has 7 characters (since that is what I see there), but the C library think it has 0x0A (ie: 10) characters (second row, starting with 0A and 8 more characters). And the string itself contains double characters (the ø is encoded as C3 B8 and so on...).

The same is true for the hun and per variables.

I did the same experiment with Unicode, the following is the application:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const wchar_t* c, size_t len)
{
    void* t = (void*)c;
    fwrite(&len, sizeof(size_t), 1, fp);
    fwrite(t, len, sizeof(wchar_t), fp);
}

int main()
{
    FILE* fp = fopen("test.cod", "wb+");
    const wchar_t* ABCDE = L"ABCDE";
    write_to_file(fp, ABCDE, wcslen(ABCDE) );
    const wchar_t* nor = L"BBøæåBB";
    write_to_file(fp, nor, wcslen(nor));
    const wchar_t* hun = L"AAőűéáöüúBB";
    write_to_file(fp, hun, wcslen(hun));
    const wchar_t* per = L"CCبﺙگCC";
    write_to_file(fp, per, wcslen(per));
    fclose(fp);
}

The results here are the expected ones. 5 for the length of ABCDE 7 for the length of BBøæåBB and so on, 4 bytes per character...

hex dump of whcar_t* output

So here comes the question: what is the encoding of the standard C library, and how trustable is it when developing portable applications (ie: what I write out on a platform will be read back correctly on another one?) and what are the other recommendations taking in considerations what was presented above.

like image 990
Ferenc Deak Avatar asked Dec 20 '13 09:12

Ferenc Deak


People also ask

Does C use UTF-8?

Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.

What encoding does C use?

CPP's very first action, before it even looks for line boundaries, is to convert the file into the character set it uses for internal processing. That set is what the C standard calls the source character set. It must be isomorphic with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of Unicode.

Does C language support Unicode?

The latest C standard (C11) allows multi-national Unicode characters to be embedded portably within C source text by using \uXXXX or \UXXXXXXXX encoding (where the X denotes a hexadecimal character), although this feature is not yet widely implemented.

Which character encoding is best?

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.


1 Answers

As far as I know, the standard C library does no encoding at all. I suppose your input file in the first case uses UTF-8 as encoding, thus your string constants will end up as UTF-8-string constants in compiled code. That is why you get the string with a length of 10 chars.

fwrite takes an (untyped) byte array as argument. Since it does not know anything about the bytes processed, it cannot do any encoding-conversion at all here.

Regarding portability, you should be more careful about things like pointer lengths. fwrite(&len, sizeof(size_t), 1, fp)can yield different results on different platforms, maybe causing your file to be read incorrectly. Also (especially with multi-byte encodings) you have to be careful with the platform's endianness.

For anything else, you can be sure, that your standard library will put the bytes to disk exactly as you pass them, but when processing them as text, you have to make sure that you use the same encoding on all platforms.

like image 167
user1781290 Avatar answered Oct 16 '22 07:10

user1781290