Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode filenames on FAT-32?

As far as I understand - NTFS supports Unicode filenames (UTF-16 as Micorsoft claims?).

But official MSDN documentation is very vague regarding what codepage(s) is used to store filenames (filepaths) on FAT-32.

Here it says that OEM code page (CP437 I assume) is used to store filenames: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317748.aspx

But here it turns out that there can be different OEM codepages with CP437 being one of them: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752.aspx

And we all now that utilities like mount support many more different codepages for FAT, more than just OEM codepages set.

So what is the actual cdepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created? Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?

And more specific question: What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?

like image 842
jake.libber Avatar asked Oct 21 '13 20:10

jake.libber


People also ask

What special characters are allowed in Windows filenames?

In both NTFS and FAT file systems, the special file name characters are: '\', '/', '. ', '?' , and '*'. On OEM code pages, these special characters are in the ASCII range of characters (0x00 through 0x7F).

What characters are allowed in a FAT16 file name?

The FAT16 file system requires filenames to adhere to the 8.3 naming convention, meaning filenames are limited to eight characters followed by a period and a three-character extension. In contrast, FAT32 allows filenames up to 255 characters long.


1 Answers

You might have to experiment here. This is a great question, and I'm not 100% confident, but:

So what is the actual codepage for FAT-32 filenames? It depends on the system codepage at the time when FAT volume was created?

The "OEM codepage", whatever that is for the system.

Can FAT support true Double Byte Character Set codepages like UTF-16? Or Multi Byte Character Set codepages like UTF-8 is the limit?

No, I don't believe FAT is directly capable of either UTF-16 or UTF-8. That said, Microsoft stores the Unicode filename in an out of band method. A file thus has two filenames. (This is how you can have longer than 8.3 character filenames, as well.)

And more specific question: What happens when I use CreateFileW function (which, as MSDN states, use UTF-16 as filename codepage) to create a file on FAT-32 volume?

The Unicode filename, as passed to CreateFileW is stored directly in the out of band filename. It is re-encoded into the OEM codepage (whatever that happens to be on the system) and is put there. If it cannot be converted into the OEM codepage, or exceeds 8.3 characters, Windows will call the file something like, FILENA~1.TXT.

Some citations for these answers:

First, this page tells us that the OEM code page != the Windows code page:

Non-Unicode applications that create FAT files sometimes have to use the standard C runtime library conversion functions to translate between the Windows code page character set and the OEM code page character set. With Unicode implementations of the file system functions, it is not necessary to perform such translations.

On a typical American system, the OEM code page is "CP437", but the Windows code page is Windows-1252 (The FooA calls, I believe, use the Windows code page, typically Windows-1252 on an American machine, but depends on locale).

If you have a FAT volume available, you can see this in action. The character "Σ" (U+03a3) is not present in Windows-1252, however, it is in CP437. You can see both the short and long filenames with dir /X. With a file named asdfΣ.txt, you'll see:

ASDFΣ.TXT    asdfΣ.txt

However, with a file named "asdfΛ.txt" (Λ is not present in either CP437 or Windows-1252), you'll see:

ASDF~1.TXT   asdf?.txt

(You'll likely see ?, because cmd.exe's font cannot display a Λ.)

For information about long filenames, see this Wikipedia article.

Also, interestingly, if you name a file "asdf©.txt", you might get:

ASDFC.TXT    asdfc.txt

… I'm not 100% sure here, but I think Windows cleverly decided to substitute "c" for ©, and did likewise for displaying it. If you change the font to something not raster based, like Consolas, you'll see:

ASDFC.TXT    asdf©.txt

And this is why you should use the FooW functions.

like image 103
Thanatos Avatar answered Sep 22 '22 06:09

Thanatos