In the following Java program running in Linux using OpenJDK 1.6.0_22 I simply list the contents of the directory taken in as parameter at the command line. The directory contains the files which have file names in UTF-8 (e.g. Hindi, Mandarin, German etc.).
import java.io.*;
class ListDir {
public static void main(String[] args) throws Exception {
//System.setProperty("file.encoding", "en_US.UTF-8");
System.out.println(System.getProperty("file.encoding"));
File f = new File(args[0]);
for(String c : f.list()) {
String absPath = args[0] + "" + c;
File cf = new File(args[0] + "/" + c);
System.out.println(cf.getAbsolutePath() + " --> " + cf.exists());
}
}
}
If I set the LC_ALL variable to en_US.UTF-8 the results are printed fine. But if I set the LC_ALL variable to POSIX and supply the file.encoding and sun.jnu.encoding properties as UTF-8 from command line I get the garbage output and cf.exists() returns false.
Can you please explain this behavior. As I read on so many websites file.encoding is said to be sufficient to read file names and use them for operations. Here it looks like that property has no effect at all.
Update 1: If I set file.encoding to something like GBK (Chinese) and LC_ALL variable to en_US.UTF-8 then cf.exists() returns true. only the '?' appears instead of file name. Surprise o_O.
Update 2: More investigation and it looks like its not a Java issue. It looks like libc on Linux used locale settings to translate file name encodings and those settings will cause file not found error/exception. "file.encoding" is for how Java interprets file names.
Update 3 Now it looks problem is how Java interprets file names. The following simple C code works on Linux regardless of file encoding and value of LC_ALL environment variable (I am happy that this proves for answer given here: https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-encoding). But still I am not clear how Java interprets on LC_ALL variable. Now looking into OpenJDK code for that.
Sample C code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dirent.h>
int main(int argc, char *argv[])
{
char *argdir = argv[1];
DIR *dp = opendir(argdir);
struct dirent *de;
while(de = readdir(dp)) {
char *abspath = (char *) malloc(strlen(argdir) + 1 + strlen(de->d_name) + 1);
strcpy(abspath, argdir);
abspath[strlen(argdir)] = '/';
strcpy(abspath + strlen(argdir) + 1, de->d_name);
printf("%d %s ", de->d_type, abspath);
FILE *fp = fopen(abspath, "r");
if (fp) {
printf("Success");
}
fclose(fp);
putchar('\n');
}
}
The LC_ALL variable sets all locale variables output by the command 'locale -a'. It is a convenient way of specifying a language environment with one variable, without having to specify each LC_* variable. Processes launched in that environment will run in the specified locale.
LC_ALL is the strongest locale environment variable, except for LANGUAGE. It overrides every other variable in priority and is the first to be checked by the system when a locale setting is needed. Thus, it should be used with caution, and only when there are no other solutions to the problem we're trying to solve.
Note: So finally I think that I have nailed it down. I am not confirming that it is right. But with some code reading and tests this is what I found out and I don't have additional time to look into it. If anyone is interested they can check it out and tell if this answer is right or wrong - I would be glad :)
The reference I used was from this tarball available at OpenJDK's site: openjdk-6-src-b25-01_may_2012.tar.gz
Java natively translates all string to platform's local encoding in this method: jdk/src/share/native/common/jni_util.c - JNU_GetStringPlatformChars()
. System property sun.jnu.encoding
is used to determine the platform's encoding.
The value of sun.jnu.encoding
is set at jdk/src/solaris/native/java/lang/java_props_md.c - GetJavaProperties()
using setlocale()
method of libc. Environment variable LC_ALL
is used to set the value of sun.jnu.encoding
. Value given at the command prompt using -Dsun.jnu.encoding
option to Java is ignored.
Call to File.exists()
has been coded in file jdk/src/share/classes/java/io/File.java
and it returns as
return ((fs.getBooleanAttributes(this) & FileSystem.BA_EXISTS) != 0);
getBooleanAttributes()
is natively coded (and I am skipping steps in code browsing through many files) in jdk/src/share/native/java/io/UnixFileSystem_md.c
in function :
Java_java_io_UnixFileSystem_getBooleanAttributes0()
. Here the macro
WITH_FIELD_PLATFORM_STRING(env, file, ids.path, path)
converts path string to platform's encoding.
So conversion to wrong encoding will actually send a wrong C string (char array) to subsequent call to stat()
method. And it will return with result that file cannot be found.
LESSON: LC_ALL
is very important
I'm not sure where you read about file.encoding
. I don't see it mentioned with the other standard properties as documented with System.getProperties
. But judging from my experiments, it seems that this value influences the encoding of file content, not file names. System.out
in particular will not print non-ASCII characters if file.encoding
is POSIX
.
On the other hand, the Linux way to decide which encoding applies to file names is the LC_CTYPE
facet of the current locale setting. I see no reason why Java should override this. As many other platforms (Windows in particular) always use Unicode for file names, not bytes, there is little point in exposing the byte-level details of the file system to a Java application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With