What is the difference between Unicode code points and Unicode scalars?

Tags:

I see the two being used (seemingly) interchangeably in many cases - are they the same or different? This also seems to depend on whether the language is talking about UTF-8 (e.g. Rust) or UTF-16 (e.g. Java/Haskell). Is the code point/scalar distinction somehow dependent on the encoding scheme?

827

asked Jan 26 '18 16:01

typesanitizer

1 Answers

First let's look at definitions D9, D10 and D10a, Section 3.4, Characters and Encoding:

D9 Unicode codespace: A range of integers from 0 to 10FFFF₁₆.

D10 Code point: Any value in the Unicode codespace.

• A code point is also known as a code position.

...

D10a Code point type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

[emphasis added]

Okay, so code points are integers in a certain range. They are divided into categories called "code point types".

Now let's look at definition D76, Section 3.9, Unicode Encoding Forms:

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF₁₆ and E000₁₆ to 10FFFF₁₆, inclusive.

Surrogates are defined and explained in Section 3.8, just before D76. The gist of the story is that surrogates are divided into two categories high-surrogates and low-surrogates. These are used only by UTF-16 so that it can represent all code points. (There are 1,114,112 code points but 2¹⁶ = 65536 is much less than that.) UTF-8 doesn't have this problem; it is a variable length encoding scheme (code points can be 1-4 bytes long), so it can accommodate all the code points without using surrogates.

Summary: a code point is either a scalar or a surrogate. A code point is merely a number in the most abstract sense; how that number is encoded into binary form is a separate issue. UTF-16 uses surrogate pairs because it can't directly represent all possible code points. UTF-8 doesn't use surrogate pairs.

In the future, you might find consulting the Unicode glossary helpful. It contains many of the frequently used definitions, as well as links to the definitions in the Unicode specification.

answered Sep 19 '22 21:09

typesanitizer

Related questions
                            
                                Keras, append to logs from callback
                            
                                Is there a method to iterate through all documents in a collection in firestore
                            
                                Airflow depends_on_past explanation
                            
                                How to remove a directory? Is os.removedirs and os.rmdir only used to delete empty directories? [duplicate]
                            
                                ModuleNotFoundError: No module named 'matplotlib' [duplicate]
                            
                                Should cookie values be URL encoded?
                            
                                Incorrect variable names in overridden methods
                            
                                When to provision in Packer vs Terraform?
                            
                                How to restrict the jenkins pipeline docker agent in specific slave?
                            
                                Introspecting a WSDL with Python Zeep
                            
                                Tab navigator icons in React Navigation
                            
                                Sign Android App Bundle from Command Line

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between Unicode code points and Unicode scalars?

Tags:

typesanitizer

People also ask

1 Answers

typesanitizer

Recent Activity

Donate For Us