Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Windows Codepage Interactions with Standard C/C++ filenames?

A customer is complaining that our code used to write files with Japanese characters in the filename but no longer works in all cases. We have always just used good old char * strings to represent filenames, so it came as a bit of a shock to me that it ever worked, and we haven't done anything I am aware of that should have made it stop working. I had them send me a file with an embedded filename in it exported from our software, and it looks like the strings use hex characters 82 and 83 as the first character of a double-byte sequence to represent the Japanese characters. Poking around online leads me to believe this is probably SHIFT_JIS and/or Windows codepage 932.

It looks to me like what is happening is previously both fopen and ofstream::open accepted filenames using this codepage; now only fopen does. I've checked the Visual Studio fopen docs, and I see no hint of what makes an acceptable string to pass to fopen.

In the short run, I'm hoping someone can shed some light on the specific Windows fopen versus ofstream::open issue for me. In the long run, I'd really like to know the accepted way of opening Unicode (and other?) filenames in C++, on Windows, Linux, and OS X.

Edited to add: I believe that the opens that work are done in the "C" locale, whereas the ones that do not work are done in whatever the customer's default locale is. However, that has been the case for years now, and the old version of the program still works today on their system, so this seems a longshot for explaining the issue we are seeing.

Update: I sent off a small test program to the customer. It has verified that fopen works fine with the SHIFT_JIS filename, and std::ofstream does not. This is in Visual Studio 2005, and happened regardless of whether I used the default locale or the "C" locale.

I'm still interested if anyone has an explanation for this behavior (and why it mysteriously changed -- perhaps a service pack of VS2005?) and hoping to put together a comprehensive "best practices" for handling Unicode filenames in portable C++ code.

like image 924
Sol Avatar asked Jan 26 '09 18:01

Sol


People also ask

What encoding does Windows file system use?

NTFS stores filenames in UTF-16, however fopen is using ANSI (not UTF-8). In order to use an UTF16-encoded file name you will need to use the Unicode versions of the file open calls. Do this by defining UNICODE and _UNICODE in your project.

Is NTFS Unicode?

NTFS stores file names in Unicode. In contrast, the older FAT12, FAT16, and FAT32 file systems use the OEM character set.


2 Answers

Functions like fopen or ofstream::open take the file name as char *, but that is interpreted as being in the system code page.

It means that it can be a Japanese character represented as Shift-JIS (cp932), or Chinese Simplified (Big 5/cp936), Korean, Arabic, Russian, you name it (as long as it matches the OS system code page).

It also means that it can use Japanese file names on a Japanese system only. Change the system code page and the application "stops working" I suspect this is what happens here (no big changes in Windows since Win 2000, in this area).

This is how you change the system code page: http://www.mihai-nita.net/article.php?artID=20050611a

In the long run you might consider moving to Unicode (and using _wfopen, wofstream).

like image 98
Mihai Nita Avatar answered Oct 23 '22 03:10

Mihai Nita


I'm not aware of any portable way of using unicode files using default system libraries. But there are some frameworks that provide portable functions, for example:

  • for C: glib uses filenames in UTF-8;
  • for C++: glibmm also uses filenames in UTF-8, requires glib;
  • for C++: boost can use wstring for filenames.

I'm pretty sure .NET/mono frameworks also do contain portable filesystem functions, but I don't know them.

like image 29
Tometzky Avatar answered Oct 23 '22 05:10

Tometzky