Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Unicode Confusion

Tags:

java

unicode

HEy all, I have only just started attempting to learn Java and have run into something that is really confusing!

I was typing out an example from the book I am using. It is to demonstrate the char data type.

The code is as follows :

public class CharDemo
{
public static void main(String [] args)
{
char a = 'A';
char b = (char) (a + 1);
System.out.println(a + b);
System.out.println("a + b is " + a + b);
int x = 75;
char y = (char) x;
char half = '\u00AB';
System.out.println("y is " + y + " and half is " + half);
}
}

The bit that is confusing me is the statement, char half = '\u00AB'. The book states that \u00AB is the code for the symbol '1/2'. As described, when I compile and run the program from cmd the symbol that is produced on this line is in fact a '1/2'.

So everything appears to be working as it should. I decided to play around with the code and try some different unicodes. I googled multiple unicode tables and found none of them to be consistent with the above result.

In every one I found it stated that the code /u00AB was not for '1/2' and was in fact for this:

http://www.fileformat.info/info/unic...r/ab/index.htm So what character set is Java using, I thought UNicode was supposed to be just that, Uni, only one. I have searched for hours and nowhere can I find a character set that states /u00AB is equal to a 1/2, yet this is what my java compiler interprets it as.

I must be missing something obvious here! Thanks for any help!

like image 762
Nick Avatar asked Jan 20 '11 12:01

Nick


1 Answers

It's a well-known problem with console encoding mismatch on Windows platforms.

Java Runtime expects that encoding used by the system console is the same as the system default encoding. However, Windows uses two separate encodings: ANSI code page (system default encoding) and OEM code page (console encoding).

So, when you try to write Unicode character U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK to the console, Java runtime expects that console encoding is the ANSI encoding (that is Windows-1252 in your case), where this Unicode character is represented as 0xAB. However, the actual console encoding is the OEM encoding (CP437 in your case), where 0xAB means ½.

Therefore printing data to Windows console with System.out.println() produces wrong results.

To get correct results you can use System.console().writer().println() instead.

like image 98
axtavt Avatar answered Oct 14 '22 05:10

axtavt