Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is parameter to string.indexOf method is an int in Java

I am wondering why the parameter to indexOf method an int , when the description says a char.

public int indexOf(int ch)

Returns the index within this string of the first occurrence of the specified **character**

http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#indexOf%28int%29

Also, both of these compiles fine:
char c = 'p';
str.indexOf(2147483647);
str.indexOf(c);

a]Basically, what I am confused about is int in java is 32bit , while unicode characters are 16 bits .

b]Why not use the character themselves rather than using int . Is this any performance optimization ?. Are chars difficult to represent than int ? How ?

I assume this should be simple reasoning for this and that makes me know about it even more !

Thanks!

like image 395
codeObserver Avatar asked Jun 02 '11 04:06

codeObserver


People also ask

What are the parameters of indexOf () in Java?

The indexOf method takes two parameters: A literal string “Java” whose starting index has to be found and integer type fromindex 10, which specifies the index after which the “Java” string's starting index has to be returned which would be 25 in this case.

What does the method indexOf require as a parameter?

The indexOf method accepts two parameters. The first is the element to check if given array contains it or not. The other parameter, starting index, is optional.

What is the return type of indexOf method in a string?

The indexOf() method returns the position of the first occurrence of specified character(s) in a string.


2 Answers

The real reason is that indexOf(int) expects a Unicode codepoint, not a 16-bit UTF-16 "character". Unicode code points are actually up to 21 bits in length.

(The UTF-16 representation of a longer codepoint is actually 2 16-bit "character" values. These values are known as leading and trailing surrogates; D80016 to DBFF16, and DC0016 to DFFF16 respectively; see Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM for the gory details.)

If you give indexOf(int) a code point > 65535 it will search for the pair of UTF-16 characters that encode the codepoint.

This is stated by the javadoc (albeit not very clearly), and an examination of the code indicates that this is indeed how the method is implemented.


Why not just use 16-bit characters ?

That's pretty obvious. If they did that, there wouldn't be an easy way to locate code points greater than 65535 in Strings. That would be a major problem for people who develop internationalized applications where text may contain such code points. (A lot of supposedly internationalized applications make the incorrect assumption that a char represents a code point. Often it doesn't matter, but increasingly often it does.)

But it shouldn't make any difference to you. The method will still work if your Strings consist of only 16 bit codes ... or, for that matter, of only ASCII codes.

like image 139
Stephen C Avatar answered Sep 25 '22 08:09

Stephen C


Characters in Java are stored in their unicode integer representation. The Character class documentation has more details about this format.

From the docs on that page:

The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

like image 31
Kai Avatar answered Sep 24 '22 08:09

Kai