The C stdio character encoding

Tags:

For my pet project I am experimenting with string representations, but I arrived to some troubling results. Firstly, here is a short application:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const char* c, size_t len)
{
    void* t = (void*)c;
    fwrite(&len, sizeof(size_t), 1, fp);
    fwrite(t, len, sizeof(char), fp);
}
int main()
{
    FILE* fp = fopen("test.cod", "wb+");
    const char* ABCDE = "ABCDE";
    write_to_file(fp, ABCDE, strlen(ABCDE) );
    const char* nor = "BBøæåBB";
    write_to_file(fp, nor, strlen(nor));
    const char* hun = "AAőűéáöüúBB";
    write_to_file(fp, hun, strlen(hun));
    const char* per = "CCبﺙگCC";
    write_to_file(fp, per, strlen(per));
    fclose(fp);
}

It does nothing special, just takes in a string, and writes it's length and the string itself to a file. Now, the file, when viewed as hex, looks like:

hex dump of standard char* output

I am happy with the first result, 5 (the first 8 bytes, I'm on a 64 bit machine) as expected. However, the nor variable in my expectation has 7 characters (since that is what I see there), but the C library think it has 0x0A (ie: 10) characters (second row, starting with 0A and 8 more characters). And the string itself contains double characters (the ø is encoded as C3 B8 and so on...).

The same is true for the hun and per variables.

I did the same experiment with Unicode, the following is the application:

#include <stdio.h>
#include <stddef.h>
#include <string.h>
void write_to_file(FILE* fp, const wchar_t* c, size_t len)
{
    void* t = (void*)c;
    fwrite(&len, sizeof(size_t), 1, fp);
    fwrite(t, len, sizeof(wchar_t), fp);
}

int main()
{
    FILE* fp = fopen("test.cod", "wb+");
    const wchar_t* ABCDE = L"ABCDE";
    write_to_file(fp, ABCDE, wcslen(ABCDE) );
    const wchar_t* nor = L"BBøæåBB";
    write_to_file(fp, nor, wcslen(nor));
    const wchar_t* hun = L"AAőűéáöüúBB";
    write_to_file(fp, hun, wcslen(hun));
    const wchar_t* per = L"CCبﺙگCC";
    write_to_file(fp, per, wcslen(per));
    fclose(fp);
}

The results here are the expected ones. 5 for the length of ABCDE 7 for the length of BBøæåBB and so on, 4 bytes per character...

hex dump of whcar_t* output

So here comes the question: what is the encoding of the standard C library, and how trustable is it when developing portable applications (ie: what I write out on a platform will be read back correctly on another one?) and what are the other recommendations taking in considerations what was presented above.

990

asked Dec 20 '13 09:12

Ferenc Deak

1 Answers

As far as I know, the standard C library does no encoding at all. I suppose your input file in the first case uses UTF-8 as encoding, thus your string constants will end up as UTF-8-string constants in compiled code. That is why you get the string with a length of 10 chars.

fwrite takes an (untyped) byte array as argument. Since it does not know anything about the bytes processed, it cannot do any encoding-conversion at all here.

Regarding portability, you should be more careful about things like pointer lengths. fwrite(&len, sizeof(size_t), 1, fp)can yield different results on different platforms, maybe causing your file to be read incorrectly. Also (especially with multi-byte encodings) you have to be careful with the platform's endianness.

For anything else, you can be sure, that your standard library will put the bytes to disk exactly as you pass them, but when processing them as text, you have to make sure that you use the same encoding on all platforms.

167

answered Oct 16 '22 07:10

user1781290

Related questions
                            
                                Suppress messages printing from a const method
                            
                                Why the code doesn't break at the breakpoint in code blocks
                            
                                How to find minimum possible projection of a polygon on X axis, after rotating at an arbitrary angle?
                            
                                openCV 2.4.7 error adding symbols: DSO missing from command line
                            
                                There must be a really fast way to calculate this bitwise expression?
                            
                                How to Insert breakpoint while gdb is executing
                            
                                GNU Compilers vs. Visual Studio on Arrays Allocated w/ Length Constant w/in a Scope
                            
                                COM registration fails with error code 0xC0000005
                            
                                What happens if I compile and link with unneeded libraries in GCC?
                            
                                Return vector<pair<int,int>> & from c++ method to python list of tuples using swig typemap
                            
                                Static compilation of Qt 5 fails under mingw with reference to off64_t
                            
                                How to read an UTF-8 encoded file containing Chinese characters and output them correctly on console?
                            
                                shared_ptr does not find virtual method
                            
                                How to make class sortable inside vector?
                            
                                Can a variadic template match a non-variadic template parameter?
                            
                                MySQL C++/Connector setClientOption to support multiple statements
                            
                                Is it still legal to do pointer arithmetic on a deleted array?
                            
                                Is it possible to compose STL algorithms without an intermediate container
                            
                                Is it possible to iterate over all elements in a struct or class?
                            
                                Should I be using numeric_limits or C limit macros?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

The C stdio character encoding

Tags:

c++

c

character-encoding

Ferenc Deak

People also ask

1 Answers

user1781290

Recent Activity

Donate For Us