For my pet project I am experimenting with string representations, but I arrived to some troubling results. Firstly, here is a short application:
#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const char* c, size_t len)
{
void* t = (void*)c;
fwrite(&len, sizeof(size_t), 1, fp);
fwrite(t, len, sizeof(char), fp);
}
int main()
{
FILE* fp = fopen("test.cod", "wb+");
const char* ABCDE = "ABCDE";
write_to_file(fp, ABCDE, strlen(ABCDE) );
const char* nor = "BBøæåBB";
write_to_file(fp, nor, strlen(nor));
const char* hun = "AAőűéáöüúBB";
write_to_file(fp, hun, strlen(hun));
const char* per = "CCبﺙگCC";
write_to_file(fp, per, strlen(per));
fclose(fp);
}
It does nothing special, just takes in a string, and writes it's length and the string itself to a file. Now, the file, when viewed as hex, looks like:
I am happy with the first result, 5 (the first 8 bytes, I'm on a 64 bit machine) as expected. However, the nor
variable in my expectation has 7 characters (since that is what I see there), but the C library think it has 0x0A
(ie: 10) characters (second row, starting with 0A
and 8 more characters). And the string itself contains double characters (the ø
is encoded as C3 B8
and so on...).
The same is true for the hun
and per
variables.
I did the same experiment with Unicode, the following is the application:
#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const wchar_t* c, size_t len)
{
void* t = (void*)c;
fwrite(&len, sizeof(size_t), 1, fp);
fwrite(t, len, sizeof(wchar_t), fp);
}
int main()
{
FILE* fp = fopen("test.cod", "wb+");
const wchar_t* ABCDE = L"ABCDE";
write_to_file(fp, ABCDE, wcslen(ABCDE) );
const wchar_t* nor = L"BBøæåBB";
write_to_file(fp, nor, wcslen(nor));
const wchar_t* hun = L"AAőűéáöüúBB";
write_to_file(fp, hun, wcslen(hun));
const wchar_t* per = L"CCبﺙگCC";
write_to_file(fp, per, wcslen(per));
fclose(fp);
}
The results here are the expected ones. 5 for the length of ABCDE
7 for the length of BBøæåBB
and so on, 4 bytes per character...
So here comes the question: what is the encoding of the standard C library, and how trustable is it when developing portable applications (ie: what I write out on a platform will be read back correctly on another one?) and what are the other recommendations taking in considerations what was presented above.
Most C string library routines still work with UTF-8, since they only scan for terminating NUL characters.
CPP's very first action, before it even looks for line boundaries, is to convert the file into the character set it uses for internal processing. That set is what the C standard calls the source character set. It must be isomorphic with ISO 10646, also known as Unicode. CPP uses the UTF-8 encoding of Unicode.
The latest C standard (C11) allows multi-national Unicode characters to be embedded portably within C source text by using \uXXXX or \UXXXXXXXX encoding (where the X denotes a hexadecimal character), although this feature is not yet widely implemented.
As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.
As far as I know, the standard C library does no encoding at all. I suppose your input file in the first case uses UTF-8 as encoding, thus your string constants will end up as UTF-8-string constants in compiled code. That is why you get the string with a length of 10 chars.
fwrite
takes an (untyped) byte array as argument. Since it does not know anything about the bytes processed, it cannot do any encoding-conversion at all here.
Regarding portability, you should be more careful about things like pointer lengths. fwrite(&len, sizeof(size_t), 1, fp)
can yield different results on different platforms, maybe causing your file to be read incorrectly. Also (especially with multi-byte encodings) you have to be careful with the platform's endianness.
For anything else, you can be sure, that your standard library will put the bytes to disk exactly as you pass them, but when processing them as text, you have to make sure that you use the same encoding on all platforms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With