How many characters can be mapped with Unicode?

Tags:

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

764

asked May 07 '11 21:05

Ufuk Hacıoğulları

1 Answers

I am asking for the count of all the possible valid combinations in Unicode with explanation.

1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters

Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.

137,929 code points are actually assigned in Unicode 12.1.

I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.

For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.

In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.

answered Sep 22 '22 05:09

dan04

Related questions
                            
                                Why is 'U+' used to designate a Unicode code point?
                            
                                Removing non-ASCII characters from data files
                            
                                Convert a Unicode string to an escaped ASCII string
                            
                                How to internationalize/localize a JSP/Servlet web application?
                            
                                Usage of unicode() and encode() functions in Python
                            
                                Invisible characters - ASCII
                            
                                Easy way to remove accents from a Unicode string? [duplicate]
                            
                                What is the unicode character for the close symbol used by Twitter bootstrap?
                            
                                Normalizing Unicode
                            
                                C programming: How to program for Unicode?
                            
                                Does Python forbid two similarly looking Unicode identifiers?
                            
                                Java regex for support Unicode?
                            
                                Write a file in UTF-8 using FileWriter (Java)?
                            
                                Unicode via CSS :before
                            
                                Where is Python's "best ASCII for this Unicode" database? [closed]
                            
                                Trouble with UTF-8 characters; what I see is not what I stored
                            
                                How do you change the character encoding of a postgres database?
                            
                                What's "wrong" with C++ wchar_t and wstrings? What are some alternatives to wide characters?
                            
                                Get a list of all the encodings Python can encode to
                            
                                Unicode encoding for string literals in C++11

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How many characters can be mapped with Unicode?

Tags:

unicode

utf-8

utf

Ufuk Hacıoğulları

People also ask

1 Answers

dan04

Recent Activity

Donate For Us