Here's an excerpt from java.text.CharacterIterator
documentation:
This
interface
defines a protocol for bidirectional iteration over text. The iterator iterates over a bounded sequence of characters. [...] The methodsprevious()
andnext()
are used for iteration. They returnDONE
if [...], signaling that the iterator has reached the end of the sequence.
static final char DONE
: Constant that is returned when the iterator has reached either the end or the beginning of the text. The value is\uFFFF
, the "not a character" value which should not occur in any valid Unicode string.
The italicized part is what I'm having trouble understanding, because from my tests, it looks like a Java String
can most certainly contain \uFFFF
, and there doesn't seem to be any problem with it, except obviously with the prescribed CharacterIterator
traversal idiom that breaks because of a false positive (e.g. next()
returns '\uFFFF' == DONE
when it's not really "done").
Here's a snippet to illustrate the "problem" (see also on ideone.com):
import java.text.*; public class CharacterIteratorTest { // this is the prescribed traversal idiom from the documentation public static void traverseForward(CharacterIterator iter) { for(char c = iter.first(); c != CharacterIterator.DONE; c = iter.next()) { System.out.print(c); } } public static void main(String[] args) { String s = "abc\uFFFFdef"; System.out.println(s); // abc?def System.out.println(s.indexOf('\uFFFF')); // 3 traverseForward(new StringCharacterIterator(s)); // abc } }
So what is going on here?
\uFFFF
?StringCharacterIterator
implementation "broken" because it doesn't e.g. throw
an IllegalArgumentException
if in fact \uFFFF
is forbidden in valid Unicode strings?\uFFFF
?String
to contain \uFFFF
anyway?Unicode is an international standard of character encoding which has the capability of representing a majority of written languages all over the globe. Unicode uses hexadecimal to represent a character. Unicode is a 16-bit character encoding system. The lowest value is \u0000 and the highest value is \uFFFF.
Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information.
Unicode is a computing industry standard designed to consistently and uniquely encode characters used in written languages throughout the world. The Unicode standard uses hexadecimal to express a character. For example, the value 0x0041 represents the Latin character A.
\uFFFF is a format of how Unicode is presented in where I read it from (say ASCII file), not a literal. I magined that there could be a more direct method, but this one should be also fine.
EDIT (2013-12-17): Peter O. brings up an excellent point below, which renders this answer wrong. Old answer below, for historical accuracy.
Answering your questions:
No. U+FFFF is a so-called non-character. From Section 16.7 of the Unicode Standard:
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data.
...
The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF.
Not quite. Applications are allowed to use those code points internally in any way they want. Quoting the standard again:
Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters.
So while you should never encounter such a string from the user, another application or a file, you may well put it into a Java String if you know what you're doing (this basically means that you cannot use the CharacterIterator on that string, though.
As quoted above, any string used for interchange must not contain them. Within your application you're free to use them in whatever way they want.
Of course, a Java char
, being just a 16-bit unsigned integer doesn't really care about the value it holds as well.
No. In fact, the section on noncharacters even suggests the use of U+FFFF as sentinel value:
In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses.
U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on.
CharacterIterator follows this in that it returns U+FFFF when no more characters are available. Of course, this means that if you have another use for that code point in your application you may consider using a different non-character for that purpose since U+FFFF is already taken – at least if you're using CharacterIterator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With