I'm trying to find an explanation of the terms "character", "code point" and "surrogate", and while these terms aren't limited to Java, if there are any language-specific differences I'd like the explanation as it relates to Java.
I've found some information about the differences between characters and code points, characters being what is displayed for human users, and code points being a value encoding that specific character, but I have a no idea about surrogates. What are surrogates, and how are they different from characters and code points? Do I have the right definitions for characters and code points?
In another thread about stepping through a string as an array of characters, the specific comment that prompted this question was "Note that this technique gives you characters, not code points, meaning you may get surrogates." I didn't really understand, and rather than create a long series of comments on a 5-year-old question I thought it would be best to ask for clarification in a new question.
Codepoint(1): Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. In Java, a character ( char ) is an unsigned 16 bit value; i.e 0 to FFFF. As you can see, there are more Unicode codepoints that can be represented as Java characters.
To support supplementary characters without changing the char primitive data type and causing incompatibility with previous Java programs, supplementary characters are defined by a pair of code point values that are called surrogates.
Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16.
These characters have some special values; they are made up of two Unicode characters in two specific ranges such that the first Unicode character is in one range (for example 0xD800-0xD8FF) and the second Unicode character is in the second range (for example 0xDC00-0xDCFF). This is called a surrogate pair.
To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.
A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 216.
Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.
When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.
Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.
Java uses UTF-16 internally to represent text.
In particular, a char
(character) is an unsigned two-byte value that contains a UTF-16 value.
If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With