Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are surrogate pairs calculated?

Tags:

unicode

utf-8

If a unicode uses 17 bits codepoints, how the surrogate pairs is calculated from code points?

like image 561
user496949 Avatar asked Dec 10 '22 03:12

user496949


1 Answers

Unicode code points are scalar values which range from 0x000000 to 0x10FFFF. Thus they are are 21 bit integers, not 17 bit.

Surrogate pairs are a mechanism of the UTF-16 form. This represents the 21-bit scalar values as one or two 16-bit code units.

  • Scalar values from 0x000000 to 0x00FFFF are represented as a single 16-bit code unit, from 0x0000 to 0xFFFF.
  • Scalar values from 0x00D800 to 0x00DFFF are not characters in Unicode, and so they will never occur in a Unicode character string.
  • Scalar values from 0x010000 to 0x10FFFF are represented as two 16-bit code units. The first code unit encodes the upper 11 bits of the scalar value, as a code unit ranging from 0xD800-0xDBFF. There's a bit of trickiness to encode values from 0x01-0x10 in four bits. The second code unit encodes the lower 10 bits of the scalar value, as a code unit ranging from 0xDC00-0xDFFF.

This is explained in detail, with sample code, in the Unicode consortium's FAQ, UTF-8, UTF-16, UTF-32 & BOM. That FAQ refers to the section of the Unicode Standard which has even more detail.

like image 134
Jim DeLaHunt Avatar answered Jun 01 '23 20:06

Jim DeLaHunt