I am reading something about searching for a (range of) string(s) in a sorted array of strings.
It says:
If you want to find all strings starting with "h", you can run a binary search for the strings "h" and "h\uFFFF". This gives all the indexes of the band for all the keys that start with "h". Note that a binary search can return the index where the string would be even if it is not actually in the array.
I don't understand anything from this paragraph.
What is h\uFFFF
how does it help/is used in the binary search and does the last sentense also mean that even this search is faulty?
Any help to understand what is being said here please?
\uFFFF is the "character" that sorts last in the 16-bit "alphabet", i.e. after any valid letter, character, or special symbol.
When you do binary search for a string in a sorted array, you find a place where that string could be inserted. When you have multiple identical strings, you get a location ahead of the first one. When you append "the last letter of the alphabet" after your string, the insertion point will be after the last of the identical strings, hence giving you a range of identical strings in a sorted array.
Imagine this: suppose you are not allowed to use letter Z
in your words. Now you have a sorted array of strings:
0 1 2 3 4 5 6
aab abb abc abc abd bcx bdy
If you search for abc
, binary search tells you the first place where you can insert it, which is 2. If you search for abcZ
, thoug, binary search would return 4, because abcZ
comes alphabetically right after abc
. This lets you know that the range between 2, inclusive, and 4, exclusive, is occupied by the string abc
. If both searches return the same number, you know that the string is not present in the array.
In the paragraph that you quoted, \uFFFF
plays the role of the "prohibited letter Z" from my example.
\uFFFF
is the largest possible character in Java. Since the strings are sorted, searching for h
will find the start of the range while h\uFFFF
will find the end (assuming unicode strings here) since no second character can be larger than \uFFFF
. Even if it can't match the string exactly, the search will return the index of where the target would be even if it's not really there.
update: \uFFFF
is the largest possible sortable unicode character in 16 bit block, if you are working with 32 bit blocks use U+10FFFF
(whatever that is in Java). I've personally never worked 32 bit unicode blocks in Java. See the section 16.7 of the 5.2.0 spec.
U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF . U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF . This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on
The \uFFFF
sequence in Java denotes the character with Unicode codepoint U+FFFF. However, the codepoint does not encode a character at all:
U+FFFF is used to represent a numeric value that is guaranteed not to be a character, for uses such as the final value at the end of an index.
See these references: Unicode Technical Report #16, this Unicode character chart and this character definition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With