Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java - what are characters, code points and surrogates? What difference is there between them?

I'm trying to find an explanation of the terms "character", "code point" and "surrogate", and while these terms aren't limited to Java, if there are any language-specific differences I'd like the explanation as it relates to Java.

I've found some information about the differences between characters and code points, characters being what is displayed for human users, and code points being a value encoding that specific character, but I have a no idea about surrogates. What are surrogates, and how are they different from characters and code points? Do I have the right definitions for characters and code points?

In another thread about stepping through a string as an array of characters, the specific comment that prompted this question was "Note that this technique gives you characters, not code points, meaning you may get surrogates." I didn't really understand, and rather than create a long series of comments on a 5-year-old question I thought it would be best to ask for clarification in a new question.

like image 643
Alium Britt Avatar asked Jun 01 '14 12:06

Alium Britt


People also ask

What are code points Java?

Codepoint(1): Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. In Java, a character ( char ) is an unsigned 16 bit value; i.e 0 to FFFF. As you can see, there are more Unicode codepoints that can be represented as Java characters.

What is surrogate in Java?

To support supplementary characters without changing the char primitive data type and causing incompatibility with previous Java programs, supplementary characters are defined by a pair of code point values that are called surrogates.

What is surrogate code points?

Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16.

What is surrogate character?

These characters have some special values; they are made up of two Unicode characters in two specific ranges such that the first Unicode character is in one range (for example 0xD800-0xD8FF) and the second Unicode character is in the second range (for example 0xDC00-0xDCFF). This is called a surrogate pair.


1 Answers

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 216.

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.

Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.

Java uses UTF-16 internally to represent text.

In particular, a char (character) is an unsigned two-byte value that contains a UTF-16 value.

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2

like image 86
Cephalopod Avatar answered Oct 04 '22 10:10

Cephalopod