I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is:
"草鷗外.gif"
which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif
If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail:
File[] files = folder.listFiles();
for (File file : files) {
if (!file.exists()) {
System.out.println("Failed to find File"); //Fails on the surrogate filename
}
}
Most of the code I am actually dealing with are of the form:
FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow
Is there some way I can address this problem, either escaping the filenames or opening files differently?
Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information. However, note that they are interpreted by the compiler early.
Unicode is a computing industry standard designed to consistently and uniquely encode characters used in written languages throughout the world. The Unicode standard uses hexadecimal to express a character. For example, the value 0x0041 represents the Latin character A.
Unicode is a 16-bit character encoding system. The lowest value is \u0000 and the highest value is \uFFFF. UTF-8 is a variable width character encoding. UTF-8 has the ability to be as condensed as ASCII but can also contain any Unicode characters with some increase in the size of the file.
In computing, UTF-16 (16-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding form maps each character to a sequence of 16-bit words. Characters are known as code points and the 16-bit words are known as code units.
I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8. Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.
“Modified” is a modified way of saying “badly bugged”. Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:
\xF0\xA6\xBF\xB6
it outputs a UTF-8-encoded sequence for each of the surrogates:
\xED\xA1\x9B\xED\xBF\xB6
This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway. Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above. Try to access the file with that name and boom! fail.
So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:
$ python
Python 2.x.something (blah blah)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')
On my filesystem (Linux, ext4, UTF-8), the filename “草𦿶鷗外.gif” comes out as:
['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']
which is what you want. If that's what you get, it's probably Java doing it wrong. If you get the longer six-byte-character version:
['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']
it's probably OS X doing it wrong... does it always store filenames like this? (Or did the files come from somewhere else originally?) What if you rename the file to the ‘proper’ version?:
os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With