Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles() and related methods seem to return file names in a different encoding than the rest of the system.

Note that it is not merely the display of these file names that is causing me problems. I'm mainly interested in doing a comparison of file names with a remote file storage system, so I care more about the content of the name strings than the character encoding used to print output.

Here is a program to demonstrate. It creates a file with a Unicode name then prints out URL-encoded versions of the file names obtained from the directly-created File, and the same file when listed under a parent directory (you should run this code in an empty directory). The results show the different encoding returned by the File.listFiles() method.

String fileName = "Trîcky Nåme"; File file = new File(fileName); file.createNewFile(); System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8"));  // Get parent (current) dir and list file contents File parentDir = file.getAbsoluteFile().getParentFile(); File[] children = parentDir.listFiles(); for (File child: children) {     System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8")); } 

Here's what I get when I run this test code on my systems. Note the %CC versus %C3 character representations.

OS X Snow Leopard:

File name: Tri%CC%82cky+Na%CC%8Ame Listed name: Tr%C3%AEcky+N%C3%A5me  $ java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065) Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode) 

KUbuntu Linux (running in a VM on same OS X system):

File name: Tri%CC%82cky+Na%CC%8Ame Listed name: Tr%C3%AEcky+N%C3%A5me  $ java -version java version "1.6.0_18" OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1) OpenJDK Client VM (build 16.0-b13, mixed mode, sharing) 

I have tried various hacks to get the strings to agree, including setting the file.encoding system property and various LC_CTYPE and LANG environment variables. Nothing helps, nor do I want to resort to such hacks.

Unlike this (somewhat related?) question, I am able to read data from the listed files despite the odd names

like image 493
James Murty Avatar asked Aug 31 '10 14:08

James Murty


2 Answers

Solution extracted from question:

Thanks to Stephen P for putting me on the right track.

The fix first, for the impatient. If you are compiling with Java 6 you can use the java.text.Normalizer class to normalize strings into a common form of your choice, e.g.

// Normalize to "Normalization Form Canonical Decomposition" (NFD) protected String normalizeUnicode(String str) {     Normalizer.Form form = Normalizer.Form.NFD;     if (!Normalizer.isNormalized(str, form)) {         return Normalizer.normalize(str, form);     }     return str; } 

Since java.text.Normalizer is only available in Java 6 and later, if you need to compile with Java 5 you might have to resort to the sun.text.Normalizer implementation and something like this reflection-based hack See also How does this normalize function work?

This alone is enough for me to decide I won't support compilation of my project with Java 5 :|

Here are other interesting things I learned in this sordid adventure.

  • The confusion is caused by the file names being in one of two normalization forms which cannot be directly compared: Normalization Form Canonical Decomposition (NFD) or Normalization Form Canonical Composition (NFC). The former tends to have ASCII letters followed by "modifiers" to add accents etc, while the latter has only the extended characters with no ACSCII leading character. Read the wiki page Stephen P references for a better explanation.

  • Unicode string literals like the one contained in the example code (and those received via HTTP in my real app) are in the NFD form, while file names returned by the File.listFiles() method are NFC. The following mini-example demonstrates the differences:

    String name = "Trîcky Nåme"; System.out.println("Original name: " + URLEncoder.encode(name, "UTF-8")); System.out.println("NFC Normalized name: " + URLEncoder.encode(     Normalizer.normalize(name, Normalizer.Form.NFC), "UTF-8")); System.out.println("NFD Normalized name: " + URLEncoder.encode(     Normalizer.normalize(name, Normalizer.Form.NFD), "UTF-8")); 

    Output:

    Original name: Tri%CC%82cky+Na%CC%8Ame NFC Normalized name: Tr%C3%AEcky+N%C3%A5me NFD Normalized name: Tri%CC%82cky+Na%CC%8Ame 
  • If you construct a File object with a string name, the File.getName() method will return the name in whatever form you gave it originally. However, if you call File methods that discover names on their own, they seem to return names in NFC form. This is a potentially a nasty gotcha. It certainly gotchme.

  • According to the quote below from Apple's documentation file names are stored in decomposed (NFD) form on the HFS Plus file system:

    When working within Mac OS you will find yourself using a mixture of precomposed and decomposed Unicode. For example, HFS Plus converts all file names to decomposed Unicode, while Macintosh keyboards generally produce precomposed Unicode.

    So the File.listFiles() method helpfully (?) converts file names to the (pre)composed (NFC) form.

like image 23
2 revs Avatar answered Sep 26 '22 04:09

2 revs


Using Unicode, there is more than one valid way to represent the same letter. The characters you're using in your Tricky Name are a "latin small letter i with circumflex" and a "latin small letter a with ring above".

You say "Note the %CC versus %C3 character representations", but looking closer what you see are the sequences

i 0xCC 0x82 vs. 0xC3 0xAE a 0xCC 0x8A vs. 0xC3 0xA5 

That is, the first is letter i followed by 0xCC82 which is the UTF-8 encoding of the Unicode\u0302 "combining circumflex accent" character while the second is UTF-8 for \u00EE "latin small letter i with circumflex". Similarly for the other pair, the first is the letter a followed by 0xCC8A the "combining ring above" character and the second is "latin small letter a with ring above". Both of these are valid UTF-8 encodings of valid Unicode character strings, but one is in "composed" and the other in "decomposed" format.

OS X HFS Plus volumes store strings (e.g. filenames) as "fully decomposed". A Unix file-system is really stored according to how the filesystem driver chooses to store it. You can't make any blanket statements across different types of filesystems.

See the Wikipedia article on Unicode Equivalence for general discussion of composed vs decomposed forms, which mentions OS X specifically.

See Apple's Tech Q&A QA1235 (in Objective-C unfortunately) for information on converting forms.

A recent email thread on Apple's java-dev mailing list could be of some help to you.

Basically, you need to normalize the decomposed form into a composed form before you can compare the strings.

like image 90
Stephen P Avatar answered Sep 22 '22 04:09

Stephen P