I see the two being used (seemingly) interchangeably in many cases - are they the same or different? This also seems to depend on whether the language is talking about UTF-8 (e.g. Rust) or UTF-16 (e.g. Java/Haskell). Is the code point/scalar distinction somehow dependent on the encoding scheme?
Unicode scalar values are the 21-bit codes that are the basic unit of Unicode. Each scalar value is represented by a Unicode. Scalar instance and is equivalent to a UTF-32 code unit.
A code point is a number assigned to represent an abstract character in a system for representing text (such as Unicode). In Unicode, a code point is expressed in the form "U+1234" where "1234" is the assigned number. For example, the character "A" is assigned a code point of U+0041.
The Unicode. Scalar type, representing a single Unicode scalar value, is the element type of a string's unicodeScalars collection. You can create a Unicode. Scalar instance by using a string literal that contains a single character representing exactly one Unicode scalar value.
The maximum possible number of code points Unicode can support is 1,114,112 through seventeen 16-bit planes. Each plane can support 65,536 different code points.
First let's look at definitions D9, D10 and D10a, Section 3.4, Characters and Encoding:
D9 Unicode codespace: A range of integers from 0 to 10FFFF16.
D10 Code point: Any value in the Unicode codespace.
• A code point is also known as a code position.
...
D10a Code point type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.
[emphasis added]
Okay, so code points are integers in a certain range. They are divided into categories called "code point types".
Now let's look at definition D76, Section 3.9, Unicode Encoding Forms:
D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.
• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF16 and E00016 to 10FFFF16, inclusive.
Surrogates are defined and explained in Section 3.8, just before D76. The gist of the story is that surrogates are divided into two categories high-surrogates and low-surrogates. These are used only by UTF-16 so that it can represent all code points. (There are 1,114,112 code points but 216 = 65536 is much less than that.) UTF-8 doesn't have this problem; it is a variable length encoding scheme (code points can be 1-4 bytes long), so it can accommodate all the code points without using surrogates.
Summary: a code point is either a scalar or a surrogate. A code point is merely a number in the most abstract sense; how that number is encoded into binary form is a separate issue. UTF-16 uses surrogate pairs because it can't directly represent all possible code points. UTF-8 doesn't use surrogate pairs.
In the future, you might find consulting the Unicode glossary helpful. It contains many of the frequently used definitions, as well as links to the definitions in the Unicode specification.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With