Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should the file opening interface in a C++ library use UTF-8 on Windows?

I'm working on a library (pugixml) that, among other things, provides file load/save API for XML documents using narrow-character C strings:

bool load_file(const char* path);
bool save_file(const char* path);

Currently the path is passed verbatim to fopen, which means that on Linux/OSX you can pass a UTF-8 string to open the file (or any other byte sequence that is a valid path), but on Windows you have to use Windows ANSI encoding - UTF-8 won't work.

The document data is (by default) represented using UTF-8, so if you had an XML document with a file path, you would not be able to pass the path retrieved from the document to load_file function as is - or rather, this would not work on Windows. The library provides alternative functions that use wchar_t:

bool load_file(const wchar_t* path);

But using them requires extra effort for encoding UTF8 to wchar_t.

A different approach (that is used by SQlite and GDAL - not sure if there are other C/C++ libraries that do that) involves treating the path as UTF-8 on Windows (which would be implemented by converting it to UTF-16 and using a wchar_t-aware function like _wfopen to open the file).

There are different pros and cons that I can see and I'm not sure which tradeoff is best.

On one hand, using a consistent encoding on all platforms is definitely good. This would mean that you can use file paths extracted from the XML document to open other XML documents. Also if the application that uses the library adopts UTF-8 it does not have to do extra conversions when opening XML files through the library.

On the other hand, this means that behavior of file loading is no longer the same as that of standard functions - so file access through the library is not equivalent to file access through standard fopen/std::fstream. It seems that while some libraries take the UTF-8 path, this is largely an unpopular choice (is this true?), so given an application that uses many third-party libraries, it may increase confusion instead of helping developers.

For example, passing argv[1] into load_file currently works for paths encoded using system locale encoding on Windows (e.g. if you have a Russian locale you can load any files with Russian names like that, but you won't be able to load files with Japanese characters). Switching to UTF-8 will mean that only ASCII paths work unless you retrieve the command-line arguments in some other Windows-specific way.

And of course this would be a breaking change for some users of the library.

Am I missing any important points here? Are there other libraries that take the same approach? What is better for C++ - being consistently inconsistent in file access, or striving for uniform cross-platform behavior?

Note that the question is about the default way to open the files - of course nothing prevents me from adding another pair of functions with _utf8 suffix or indicating the path encoding in some other way.

like image 607
zeuxcg Avatar asked Jun 27 '15 18:06

zeuxcg


People also ask

What are UTF-8 files?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

How do I change a file to UTF-8?

Click Tools, then select Web options. Go to the Encoding tab. In the dropdown for Save this document as: choose Unicode (UTF-8). Click Ok.

How do I identify a UTF-8 file?

Open the file in Notepad. Click 'Save As...'. In the 'Encoding:' combo box you will see the current file format. Yes, I opened the file in notepad and selected the UTF-8 format and saved it.


1 Answers

There's a growing belief that you should aim for UTF-8 only in cross-platform code, and perform conversions automatically in Windows where appropriate. utf8everywhere gives a good rundown of the reasons to prefer UTF-8 encoding.

As a recent example, libtorrent deprecated all the routines that handle wchar_t filenames, and instead asks library users to use their wchar_t-to-utf8 conversion functions before passing in filenames.

Personally, the strongest reason I would have to avoid wchar_t/wstring functions is simply to avoid having duplication of my API. Keeping the number of functions in the API down, to reduce external maintenance, documentation, and code path duplication costs is valuable. Details can be worked out internally. The mess of duplicated APIs caused by the Windows ANSI/Unicode split is probably lesson enough to avoid this in your own APIs.

like image 164
nneonneo Avatar answered Nov 07 '22 17:11

nneonneo