Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a "surrogate pair" in Java?

People also ask

What is surrogate pair in Unicode?

With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over one million additional characters.

What is surrogate character?

These characters have some special values; they are made up of two Unicode characters in two specific ranges such that the first Unicode character is in one range (for example 0xD800-0xD8FF) and the second Unicode character is in the second range (for example 0xDC00-0xDCFF). This is called a surrogate pair.

What is surrogate code?

Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16.

What is Unicode in Javascript?

Unicode in Javascript source code In Javascript, the identifiers and string literals can be expressed in Unicode via a Unicode escape sequence. The general syntax is \uXXXX , where X denotes four hexadecimal digits. For example, the letter o is denoted as '\u006F' in Unicode.


The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.

In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.

Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.

The surrogate code units are in two ranges known as "high surrogates" and "low surrogates", depending on whether they are allowed at the start or end of the two-code-unit sequence.


Early Java versions represented Unicode characters using the 16-bit char data type. This design made sense at the time, because all Unicode characters had values less than 65,535 (0xFFFF) and could be represented in 16 bits. Later, however, Unicode increased the maximum value to 1,114,111 (0x10FFFF). Because 16-bit values were too small to represent all of the Unicode characters in Unicode version 3.1, 32-bit values — called code points — were adopted for the UTF-32 encoding scheme. But 16-bit values are preferred over 32-bit values for efficient memory use, so Unicode introduced a new design to allow for the continued use of 16-bit values. This design, adopted in the UTF-16 encoding scheme, assigns 1,024 values to 16-bit high surrogates(in the range U+D800 to U+DBFF) and another 1,024 values to 16-bit low surrogates(in the range U+DC00 to U+DFFF). It uses a high surrogate followed by a low surrogate — a surrogate pair — to represent (the product of 1,024 and 1,024)1,048,576 (0x100000) values between 65,536 (0x10000) and 1,114,111 (0x10FFFF) .


Adding some more info to the above answers from this post.

Tested in Java-12, should work in all Java versions above 5.

As mentioned here: https://stackoverflow.com/a/47505451/2987755,
whichever character (whose Unicode is above U+FFFF) is represented as a surrogate pair, which Java stores as a pair of char values, i.e. the single Unicode character is represented as two adjacent Java characters.
As we can see in the following example.
1. Length:

"🌉".length()  //2, Expectations was it should return 1

"🌉".codePointCount(0,"🌉".length())  //1, To get the number of Unicode characters in a Java String  

2. Equality:
Represent "🌉" to String using Unicode \ud83c\udf09 as below and check equality.

"🌉".equals("\ud83c\udf09") // true

Java does not support UTF-32

"🌉".equals("\u1F309") // false  

3. You can convert Unicode character to Java String

"🌉".equals(new String(Character.toChars(0x0001F309))) //true

4. String.substring() does not consider supplementary characters

"🌉🌐".substring(0,1) //"?"
"🌉🌐".substring(0,2) //"🌉"
"🌉🌐".substring(0,4) //"🌉🌐"

To solve this we can use String.offsetByCodePoints(int index, int codePointOffset)

"🌉🌐".substring(0,"🌉🌐".offsetByCodePoints(0,1) // "🌉"
"🌉🌐".substring(2,"🌉🌐".offsetByCodePoints(1,2)) // "🌐"

5. Iterating Unicode string with BreakIterator
6. Sorting Strings with Unicode java.text.Collator
7. Character's toUpperCase(), toLowerCase(), methods should not be used, instead, use String uppercase and lowercase of particular locale.
8. Character.isLetter(char ch) does not support, better used Character.isLetter(int codePoint), for each methodName(char ch) method in the Character class there will be type of methodName(int codePoint) which can handle supplementary characters.
9. Specify charset in String.getBytes(), converting from Bytes to String, InputStreamReader, OutputStreamWriter

Ref:
https://coolsymbol.com/emojis/emoji-for-copy-and-paste.html#objects
https://www.online-toolz.com/tools/text-unicode-entities-convertor.php
https://www.ibm.com/developerworks/library/j-unicode/index.html
https://www.oracle.com/technetwork/articles/javaee/supplementary-142654.html

More info on example image1 image2
Other terms worth to explore: Normalization, BiDi


What that documentation is saying is that invalid UTF-16 strings may become valid after calling the reverse method since they might be the reverses of valid strings. A surrogate pair (discussed here) is a pair of 16-bit values in UTF-16 that encode a single Unicode code point; the low and high surrogates are the two halves of that encoding.