Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between Unicode code points and Unicode scalars?

Tags:

I see the two being used (seemingly) interchangeably in many cases - are they the same or different? This also seems to depend on whether the language is talking about UTF-8 (e.g. Rust) or UTF-16 (e.g. Java/Haskell). Is the code point/scalar distinction somehow dependent on the encoding scheme?

like image 827
typesanitizer Avatar asked Jan 26 '18 16:01

typesanitizer


People also ask

What is a Unicode scalar?

Unicode scalar values are the 21-bit codes that are the basic unit of Unicode. Each scalar value is represented by a Unicode. Scalar instance and is equivalent to a UTF-32 code unit.

What is a code point in Unicode?

A code point is a number assigned to represent an abstract character in a system for representing text (such as Unicode). In Unicode, a code point is expressed in the form "U+1234" where "1234" is the assigned number. For example, the character "A" is assigned a code point of U+0041.

What is Unicode scalar in Swift?

The Unicode. Scalar type, representing a single Unicode scalar value, is the element type of a string's unicodeScalars collection. You can create a Unicode. Scalar instance by using a string literal that contains a single character representing exactly one Unicode scalar value.

How many Unicode code points are there?

The maximum possible number of code points Unicode can support is 1,114,112 through seventeen 16-bit planes. Each plane can support 65,536 different code points.


1 Answers

First let's look at definitions D9, D10 and D10a, Section 3.4, Characters and Encoding:

D9 Unicode codespace: A range of integers from 0 to 10FFFF16.

D10 Code point: Any value in the Unicode codespace.

• A code point is also known as a code position.

...

D10a Code point type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

[emphasis added]

Okay, so code points are integers in a certain range. They are divided into categories called "code point types".

Now let's look at definition D76, Section 3.9, Unicode Encoding Forms:

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF16 and E00016 to 10FFFF16, inclusive.

Surrogates are defined and explained in Section 3.8, just before D76. The gist of the story is that surrogates are divided into two categories high-surrogates and low-surrogates. These are used only by UTF-16 so that it can represent all code points. (There are 1,114,112 code points but 216 = 65536 is much less than that.) UTF-8 doesn't have this problem; it is a variable length encoding scheme (code points can be 1-4 bytes long), so it can accommodate all the code points without using surrogates.

Summary: a code point is either a scalar or a surrogate. A code point is merely a number in the most abstract sense; how that number is encoded into binary form is a separate issue. UTF-16 uses surrogate pairs because it can't directly represent all possible code points. UTF-8 doesn't use surrogate pairs.

In the future, you might find consulting the Unicode glossary helpful. It contains many of the frequently used definitions, as well as links to the definitions in the Unicode specification.

like image 54
typesanitizer Avatar answered Sep 19 '22 21:09

typesanitizer