Should the file opening interface in a C++ library use UTF-8 on Windows?

Tags:

I'm working on a library (pugixml) that, among other things, provides file load/save API for XML documents using narrow-character C strings:

bool load_file(const char* path);
bool save_file(const char* path);

Currently the path is passed verbatim to fopen, which means that on Linux/OSX you can pass a UTF-8 string to open the file (or any other byte sequence that is a valid path), but on Windows you have to use Windows ANSI encoding - UTF-8 won't work.

The document data is (by default) represented using UTF-8, so if you had an XML document with a file path, you would not be able to pass the path retrieved from the document to load_file function as is - or rather, this would not work on Windows. The library provides alternative functions that use wchar_t:

bool load_file(const wchar_t* path);

But using them requires extra effort for encoding UTF8 to wchar_t.

A different approach (that is used by SQlite and GDAL - not sure if there are other C/C++ libraries that do that) involves treating the path as UTF-8 on Windows (which would be implemented by converting it to UTF-16 and using a wchar_t-aware function like _wfopen to open the file).

There are different pros and cons that I can see and I'm not sure which tradeoff is best.

On one hand, using a consistent encoding on all platforms is definitely good. This would mean that you can use file paths extracted from the XML document to open other XML documents. Also if the application that uses the library adopts UTF-8 it does not have to do extra conversions when opening XML files through the library.

On the other hand, this means that behavior of file loading is no longer the same as that of standard functions - so file access through the library is not equivalent to file access through standard fopen/std::fstream. It seems that while some libraries take the UTF-8 path, this is largely an unpopular choice (is this true?), so given an application that uses many third-party libraries, it may increase confusion instead of helping developers.

For example, passing argv[1] into load_file currently works for paths encoded using system locale encoding on Windows (e.g. if you have a Russian locale you can load any files with Russian names like that, but you won't be able to load files with Japanese characters). Switching to UTF-8 will mean that only ASCII paths work unless you retrieve the command-line arguments in some other Windows-specific way.

And of course this would be a breaking change for some users of the library.

Am I missing any important points here? Are there other libraries that take the same approach? What is better for C++ - being consistently inconsistent in file access, or striving for uniform cross-platform behavior?

Note that the question is about the default way to open the files - of course nothing prevents me from adding another pair of functions with _utf8 suffix or indicating the path encoding in some other way.

607

asked Jun 27 '15 18:06

zeuxcg

1 Answers

There's a growing belief that you should aim for UTF-8 only in cross-platform code, and perform conversions automatically in Windows where appropriate. utf8everywhere gives a good rundown of the reasons to prefer UTF-8 encoding.

As a recent example, libtorrent deprecated all the routines that handle wchar_t filenames, and instead asks library users to use their wchar_t-to-utf8 conversion functions before passing in filenames.

Personally, the strongest reason I would have to avoid wchar_t/wstring functions is simply to avoid having duplication of my API. Keeping the number of functions in the API down, to reduce external maintenance, documentation, and code path duplication costs is valuable. Details can be worked out internally. The mess of duplicated APIs caused by the Windows ANSI/Unicode split is probably lesson enough to avoid this in your own APIs.

164

answered Nov 07 '22 17:11

nneonneo

Related questions
                            
                                Why are std::begin() and std::end() overloaded for std::initializer_list in C++11?
                            
                                Template method to select between functions based on accessibility of constructor
                            
                                How do I share C++ source code files between projects in Visual Studio?
                            
                                Nested type as template parameter of base class
                            
                                Why sqrt in global scope is much slower than std::sqrt in MinGW?
                            
                                Basic usage of conditionals with std::atomic<T>
                            
                                Is it possible to change the background color of a read only edit control
                            
                                OpenGL Shader vs CUDA
                            
                                Is it safe to use negative integers with size_t?
                            
                                Protected member function address in derived class is not accessible
                            
                                Why cannot we use brace initializer in an un-evaluated context?
                            
                                "cout << cout" - what does the output stand for?
                            
                                In Dlib how do I save image with overlay?
                            
                                -Wconversion warning while using operator <<= on unsigned char
                            
                                Get precise line/column debug info from LLVM IR
                            
                                compilation failure when running theano - windows 8.1 64 bit with Anaconda python distribution
                            
                                c++ gcc string inlining
                            
                                How to rotate a rect in SDL2?
                            
                                PyArray_SimpleNewFromData example
                            
                                Is there a functional difference between "x = -x" and "x *= -1" when negating values?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Should the file opening interface in a C++ library use UTF-8 on Windows?

Tags:

c++

windows

encoding

unicode

utf-8

zeuxcg

People also ask

1 Answers

nneonneo

Recent Activity

Donate For Us