char_x

Question

Hi all I was browsing through some of the Java source code when I came across this (java.lang.Character):

public static boolean isHighSurrogate(char ch) {
    return ch >= MIN_HIGH_SURROGATE && ch < (MAX_HIGH_SURROGATE + 1);
}

public static boolean isLowSurrogate(char ch) {
    return ch >= MIN_LOW_SURROGATE && ch < (MAX_LOW_SURROGATE + 1);
}

I was wondering why did the writer added 1 to the higher limit and doing a lesser-than compare, instead of simply doing a lesser-than-or-equal compare?

I can understand if it helps readability, but in this case it doesn't seem to be that case.

I was wondering what's the difference between the code above and this:

public static boolean isHighSurrogate(char ch) {
    return ch >= MIN_HIGH_SURROGATE && ch <= MAX_HIGH_SURROGATE;
}

public static boolean isLowSurrogate(char ch) {
    return ch >= MIN_LOW_SURROGATE && ch <= MAX_LOW_SURROGATE;
}

Mike Samuel · Accepted Answer

Perhaps the author is trying to be consistent with Dijkstra's advice to make all ranges half-open -- the start point is inclusive and the endpoint is exclusive.

There is no semantic difference here, but a subtle difference in bytecode: (ch + 1) is an int so the first code snippet does a char to char comparison followed by an int to int comparison while the second does two char to char comparisons. This does not lead to a semantic difference -- the implicit casts are to wider types and so there is no risk of overflow in either code snippet.

Optimizing out the addition and converting the int to int comparison back into a 2 byte unsigned int comparison is well within the scope of the kinds of optimizations done by the JIT so I don't see any particular performance reason to prefer one over the other.

I tend to write this kind of thing as

MIN_LOW_SURROGATE <= ch && ch <= MAX_LOW_SURROGATE

that way the ch in the middle makes it obvious to a reader that the ch is being tested within the range of the outer values.

fmsf · Answer

Wild guess

Surrogate character, any of a range of Unicode codepoints which are used in pairs in UTF-16 to represent characters beyond the Basic Multilingual Plane.

In my point of view he wanted to ignore 8 bit stuff, meaning if the max was 0xFF. the 0xFF+1 would overflow and go back to 0x00. Making the comparison always false.

So if the code was compiled with chars of 8 bits. It would always return false (outside of the UTF-16 range) while if it compiles a char in >8 bits the 0xFF+1 would be 0x100 and still work.

Hope this makes some sence for you.

char_x < (char_y + 1) == char_x <= char_y?

Tags:

java

algorithm

char

Pacerier

2 Answers

Mike Samuel

fmsf

Recent Activity

Donate For Us