Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impossible for me to understand a method of string search as described. What is uFFFF?

I am reading something about searching for a (range of) string(s) in a sorted array of strings.

It says:

If you want to find all strings starting with "h", you can run a binary search for the strings "h" and "h\uFFFF". This gives all the indexes of the band for all the keys that start with "h". Note that a binary search can return the index where the string would be even if it is not actually in the array.

I don't understand anything from this paragraph.

What is h\uFFFF how does it help/is used in the binary search and does the last sentense also mean that even this search is faulty?

Any help to understand what is being said here please?

like image 906
Cratylus Avatar asked Jan 27 '12 15:01

Cratylus


3 Answers

\uFFFF is the "character" that sorts last in the 16-bit "alphabet", i.e. after any valid letter, character, or special symbol.

When you do binary search for a string in a sorted array, you find a place where that string could be inserted. When you have multiple identical strings, you get a location ahead of the first one. When you append "the last letter of the alphabet" after your string, the insertion point will be after the last of the identical strings, hence giving you a range of identical strings in a sorted array.

Imagine this: suppose you are not allowed to use letter Z in your words. Now you have a sorted array of strings:

0   1   2   3   4   5   6
aab abb abc abc abd bcx bdy

If you search for abc, binary search tells you the first place where you can insert it, which is 2. If you search for abcZ, thoug, binary search would return 4, because abcZ comes alphabetically right after abc. This lets you know that the range between 2, inclusive, and 4, exclusive, is occupied by the string abc. If both searches return the same number, you know that the string is not present in the array.

In the paragraph that you quoted, \uFFFF plays the role of the "prohibited letter Z" from my example.

like image 172
Sergey Kalinichenko Avatar answered Sep 17 '22 22:09

Sergey Kalinichenko


\uFFFF is the largest possible character in Java. Since the strings are sorted, searching for h will find the start of the range while h\uFFFF will find the end (assuming unicode strings here) since no second character can be larger than \uFFFF. Even if it can't match the string exactly, the search will return the index of where the target would be even if it's not really there.

update: \uFFFF is the largest possible sortable unicode character in 16 bit block, if you are working with 32 bit blocks use U+10FFFF (whatever that is in Java). I've personally never worked 32 bit unicode blocks in Java. See the section 16.7 of the 5.2.0 spec.

U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF . U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF . This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on

like image 20
Andrew White Avatar answered Sep 16 '22 22:09

Andrew White


The \uFFFF sequence in Java denotes the character with Unicode codepoint U+FFFF. However, the codepoint does not encode a character at all:

U+FFFF is used to represent a numeric value that is guaranteed not to be a character, for uses such as the final value at the end of an index.

See these references: Unicode Technical Report #16, this Unicode character chart and this character definition.

like image 29
Adam Zalcman Avatar answered Sep 16 '22 22:09

Adam Zalcman