Java Can't Open a File with Surrogate Unicode Values in the Filename?

Tags:

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is:

"草鷗外.gif" which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif

If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even using this on the folder containing the file will fail:

File[] files = folder.listFiles(); 
for (File file : files) {
    if (!file.exists()) {
        System.out.println("Failed to find File"); //Fails on the surrogate filename
    }
}

Most of the code I am actually dealing with are of the form:

FileInputStream instream = new FileInputStream(new File("草鷗外.gif"));
// operations follow

Is there some way I can address this problem, either escaping the filenames or opening files differently?

871

asked Oct 09 '09 19:10

Bear

1 Answers

I suspect one of Java or Mac is using CESU-8 instead of proper UTF-8. Java uses “modified UTF-8” (which is a slight variation of CESU-8) for a variety of internal purposes, but I wasn't aware it could use it as a filesystem/defaultCharset. Unfortunately I have neither Mac nor Java here to test with.

“Modified” is a modified way of saying “badly bugged”. Instead of outputting a four-byte UTF-8 sequence for supplementary (non-BMP) characters like 𦿶:

\xF0\xA6\xBF\xB6

it outputs a UTF-8-encoded sequence for each of the surrogates:

\xED\xA1\x9B\xED\xBF\xB6

This isn't a valid UTF-8 sequence, but a lot of decoders will allow it anyway. Problem is if you round-trip that through a real UTF-8 encoder you've got a different string, the four-byte one above. Try to access the file with that name and boom! fail.

So first let's just check how filenames are actually stored under your current filesystem, using a platform that uses bytes for filenames such as Python 2.x:

$ python
Python 2.x.something (blah blah)
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')

On my filesystem (Linux, ext4, UTF-8), the filename “草𦿶鷗外.gif” comes out as:

['\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

which is what you want. If that's what you get, it's probably Java doing it wrong. If you get the longer six-byte-character version:

['\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif']

it's probably OS X doing it wrong... does it always store filenames like this? (Or did the files come from somewhere else originally?) What if you rename the file to the ‘proper’ version?:

os.rename('\xe8\x8d\x89\xed\xa1\x9b\xed\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif', '\xe8\x8d\x89\xf0\xa6\xbf\xb6\xe9\xb7\x97\xe5\xa4\x96.gif')

answered Oct 15 '22 00:10

bobince

Related questions
                            
                                Multiple simultaneous substring replacements in Java
                            
                                Android Studio androidx.coordinatorlayout error and how to set up 3Dot Menu?
                            
                                java.lang.OutOfMemoryError: Java heap space in DBeaver [duplicate]
                            
                                Javadoc @link to Kotlin classes
                            
                                SpringBoot @WebMvcTest and @MockBean not working as expected
                            
                                Serialize Java8 LocalDateTime to UTC Timestamp using Jackson
                            
                                Create @MockBean with qualifier by annotating class?
                            
                                Java : Out Of Memory Error when my application runs for longer time
                            
                                Getting a wrong output using arraylists
                            
                                What is the difference between ConstraintViolationException and MethodArgumentNotValidException
                            
                                handleWindowVisibility: no activity for token android.os.BinderProxy
                            
                                How to best serialize a java.awt.Image?
                            
                                Cookie getMaxAge
                            
                                How does Hibernate create proxies of concrete classes?
                            
                                How to get JavaDoc for SWT and JFace in Eclipse?
                            
                                What is the disadvantage of DWR?
                            
                                Marshalling polymorphic objects in JAX-WS
                            
                                Testing a Spring AOP Aspect
                            
                                Java 6 NTLM proxy authentication and HTTPS - has anyone got it to work?
                            
                                Eclipse function/plugin that finds corresponding junit class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java Can't Open a File with Surrogate Unicode Values in the Filename?

Tags:

java

file

filenames

unicode

surrogate-pairs

Bear

People also ask

1 Answers

bobince

Recent Activity

Donate For Us