Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java. Why does it work differently with english and slavic characters?

I've found a rather strange thing for me while working with Java. Maybe it's an ordinary thing, but i don't understand why it works this way.

I have a code like this:

Character x = 'B';
Object o = x;
System.out.println(o == 'B');

It works fine and the output is "true". Then I change the english B to slavic B (Б):

Character x = 'Б';
Object o = x;
System.out.println(o == 'Б');

Now the output is "false". How come? By the way, the output is still "true" if i compare the x variable with 'Б' directly, but when I do it through an Object it works differently.

Can anyone, please, explain this behaviour?

like image 348
user2452103 Avatar asked Sep 10 '14 16:09

user2452103


2 Answers

Without boxing - using just char - you'd be fine. Likewise if you use equals instead of ==, you'd be fine. The problem is that you're comparing references for boxed values using ==, which just checks for reference identity. You're seeing a difference because of the way auto-boxing works. You can see the same thing with Integer:

Object x = 0;
Object y = 0;
System.out.println(x == y); // Guaranteed to be true

Object x = 10000;
Object y = 10000;
System.out.println(x == y); // *May* be true

Basically "small" values have cached boxed representations, whereas "larger" values may not.

From JLS 5.1.7:

If the value p being boxed is an integer literal of type int between -128 and 127 inclusive (§3.10.1), or the boolean literal true or false (§3.10.3), or a character literal between '\u0000' and '\u007f' inclusive (§3.10.4), then let a and b be the results of any two boxing conversions of p. It is always the case that a == b.

Ideally, boxing a primitive value would always yield an identical reference. In practice, this may not be feasible using existing implementation techniques. The rule above is a pragmatic compromise, requiring that certain common values always be boxed into indistinguishable objects. The implementation may cache these, lazily or eagerly. For other values, the rule disallows any assumptions about the identity of the boxed values on the programmer's part. This allows (but does not require) sharing of some or all of these references. Notice that integer literals of type long are allowed, but not required, to be shared.

This ensures that in most common cases, the behavior will be the desired one, without imposing an undue performance penalty, especially on small devices. Less memory-limited implementations might, for example, cache all char and short values, as well as int and long values in the range of -32K to +32K.

The part about "a character literal between \u0000 and \u007f`" guarantees that boxed ASCII characters will be cached, but not non-ASCII boxed characters.

like image 123
Jon Skeet Avatar answered Nov 12 '22 03:11

Jon Skeet


when you do

Character x = 'B' 

it invokes Character.valueOf(C)

2: invokestatic  #16                 // Method java/lang/Character.valueOf:(C)Ljava/lang/Character;

which caches

This method will always cache values in the range '\u0000' to '\u007F', inclusive, and may cache other values outside of this range.

public static Character valueOf(char c) {
    if(c <= 127) { // must cache
        return CharacterCache.cache[(int)c];
    }
    return new Character(c);
}

Similar

  • Integer wrapper class and == operator - where is behavior specified?
like image 38
jmj Avatar answered Nov 12 '22 01:11

jmj