Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing binary files using C++: does the default locale matter?

I have code that manipulates binary files using fstream with the binary flag set and using the unformatted I/O functions read and write. This works correctly on all systems I've ever used (the bits in the file are exactly as expected), but those are basically all U.S. English. I have been wondering about the potential for these bytes to be modified by a codecvt on a different system.

It sounds like the standard says using unformatted I/O behaves the same as putting characters into the streambuf using sputc/sgetc. These will lead to the overflow or underflow functions in the streambuf getting called, and it sounds like these lead to stuff going through some codecvt (e.g., see 27.8.1.4.3 in the c++ standard). For basic_filebuf the creation of this codecvt is specified in 27.8.1.1.5. This makes it look like the results will depend on what basic_filebuf.getloc() returns.

So, my question is, can I assume that a character array written out using ofstream.write on one system can be recovered verbatim using ifstream.read on another system, no matter what locale configuration either person might be using on their system? I would make the following assumptions:

  1. The program is using the default locale (i.e., the program is not changing the locale settings itself at all).
  2. The systems both have CHAR_BIT 8, have the same bit order within each byte, store files as octets, etc.
  3. The stream objects have the binary flag set.
  4. We don't need to worry about any endianess differences at this stage. If any bytes in the array are to be interpretted as a multi-byte value, endianess conversions will be handled as required at a later stage.

If the default locale isn't guaranteed to pass through this stuff unmodified on some system configuration (I don't know, Arabic or something), then what is the best way to write binary files using C++?

like image 336
TheScottMachine Avatar asked Dec 02 '09 08:12

TheScottMachine


2 Answers

If you have binary flag set, everything you write will be written to the file verbatim. No conversions. How you interpret the bytes is up to you (and possibly the locale).

One more thing: There is a possibility for breakage on different locales. If for example your data source created binary data based on locale (and format of this data would change depending on locale - this is a bad idea btw). This would cause trouble when loading data on machines with different locale. This is a design error though.

If you just use standard data types/structures that have same format/layout no matter what locale they were created in everything should be OK.

like image 198
Stan Avatar answered Sep 25 '22 05:09

Stan


Thanks for the help. I just thought it might be helpful to post some additional information about this that wouldn't fit in a comment.

The default locale for C++ programs is always the "C" locale (http://www.cplusplus.com/reference/clibrary/clocale/setlocale/). If this is the only locale used in your program, it means the behaviour doesn't depend on the particular locale configuration of the machine that it's running on. It also means that unformatted I/O for a char does not undergo any code conversion (wchar_t might be a different story though). This means that (given the assumptions in the question) read and write should allow binary data to be recovered unmodified.

(from reading the documentation) You can globally set the application's locale to match the system default by calling setlocale(LC_ALL,""), which will mean streams constructed from that point will use the system default locale. To set it back to the "C" locale you can call setlocale(LC_ALL, "C"), which will mean this is what streams constructed in the future will use. You can also specify that the "C" local should be used for a stream that's already constructed by calling stream.imbue(locale::classic()).

like image 29
TheScottMachine Avatar answered Sep 25 '22 05:09

TheScottMachine