Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Charset problem on linux

problem: I have a string containing special characters which i convert to bytes and vice versa..the conversion works properly on windows but on linux the special character is not converted properly.the default charset on linux is UTF-8 as seen with Charset.defaultCharset.getdisplayName()

however if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..

how to make it work using the UTF-8 default charset and not setting the -D option in unix environment.

edit: i use jdk1.6.13

edit:code snippet works with cs = "ISO-8859-1"; or cs="UTF-8"; on win but not in linux

        String x = "½";
        System.out.println(x);
        byte[] ba = x.getBytes(Charset.forName(cs));
        for (byte b : ba) {
            System.out.println(b);
        }
        String y = new String(ba, Charset.forName(cs));
        System.out.println(y);

~regards daed

like image 538
Inv3r53 Avatar asked Jan 30 '10 15:01

Inv3r53


3 Answers

Your characters are probably being corrupted by the compilation process and you're ending up with junk data in your class file.

if i run on linux with option -Dfile.encoding=ISO-8859-1 it works properly..

The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

In short, don't use -Dfile.encoding=...

    String x = "½";

Since U+00bd (½) will be represented by different values in different encodings:

windows-1252     BD
UTF-8            C2 BD
ISO-8859-1       BD

...you need to tell your compiler what encoding your source file is encoded as:

javac -encoding ISO-8859-1 Foo.java

Now we get to this one:

    System.out.println(x);

As a PrintStream, this will encode data to the system encoding prior to emitting the byte data. Like this:

 System.out.write(x.getBytes(Charset.defaultCharset()));

That may or may not work as you expect on some platforms - the byte encoding must match the encoding the console is expecting for the characters to show up correctly.

like image 112
McDowell Avatar answered Nov 10 '22 06:11

McDowell


Your problem is a bit vague. You mentioned that -Dfile.encoding solved your linux problem, but this is in fact only used to inform the Sun(!) JVM which encoding to use to manage filenames/pathnames at the local disk file system. And ... this does't fit in the problem description you literally gave: "converting chars to bytes and back to chars failed". I don't see what -Dfile.encoding has to do with this. There must be more into the story. How did you conclude that it failed? Did you read/write those characters from/into a pathname/filename or so? Or was you maybe printing to the stdout? Did the stdout itself use the proper encoding?

That said, why would you like to convert the chars forth and back to/from bytes? I don't see any useful business purposes for this.

(sorry, this didn't fit in a comment, but I will update this with the answer if you have given more info about the actual functional requirement).

Update: as per the comments: you basically just need to configure the stdout/cmd so that it uses the proper encoding to display those characters. In Windows you can do that with chcp command, but there's one major caveat: the standard fonts used in Windows cmd does not have the proper glyphs (the actual font pictures) for characters outside the ISO-8859 charsets. You can hack the one or other in registry to add proper fonts. No wording about Linux as I don't do it extensively, but it look like that -Dfile.encoding is somehow the way to go. After all ... I think it's better to replace cmd with a crossplatform UI tool to display the characters the way you want, for example Swing.

like image 44
BalusC Avatar answered Nov 10 '22 04:11

BalusC


You should make the conversion explicitly:

byte[] byteArray = "abcd".getBytes( "ISO-8859-1" );
new String( byteArray, "ISO-8859-1" );

EDIT:

It seems that the problem is the encoding of your java file. If it works on windows, try compiling the source files on linux with javac -encondig ISO-8859-1. This should solve your problem.

like image 1
tangens Avatar answered Nov 10 '22 06:11

tangens