How does UTF-8 encoding identify single byte and double byte characters?

Tags:

Recently I've faced an issue regarding character encoding, while I was digging into character set and character encoding this doubt came to my mind.UTF-8 encoding is most popular because of its backward compatibility with ASCII.Since UTF-8 is variable length encoding format, how it differentiates single byte and double byte characters.For example, "Aݔ" is stored as "410754" (Unicode for A is 41 and Unicode for Arabic character is 0754.How encoding identifies 41 is one character and 0754 is another two-byte character?Why it's not considered as 4107 as one double byte character and 54 as a single byte character?

995

asked Jun 15 '17 11:06

Ganesh kumar S R

1 Answers

For example, "Aݔ" is stored as "410754"

That’s not how UTF-8 works.

Characters U+0000 through U+007F (aka ASCII) are stored as single bytes. They are the only characters whose codepoints numerically match their UTF-8 presentation. For example, U+0041 becomes 0x41 which is 01000001 in binary.

All other characters are represented with multiple bytes. U+0080 through U+07FF use two bytes each, U+0800 through U+FFFF use three bytes each, and U+10000 through U+10FFFF use four bytes each.

Computers know where one character ends and the next one starts because UTF-8 was designed so that the single-byte values used for ASCII do not overlap with those used in multi-byte sequences. The bytes 0x00 through 0x7F are only used for ASCII and nothing else; the bytes above 0x7F are only used for multi-byte sequences and nothing else. Furthermore, the bytes that are used at the beginning of the multi-byte sequences also cannot occur in any other position in those sequences.

Because of that the codepoints need to be encoded. Consider the following binary patterns:

2 bytes: 110xxxxx 10xxxxxx
3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The amount of ones in the first byte tells you how many of the following bytes still belong to the same character. All bytes that belong to the sequence start with 10 in binary. To encode the character you convert its codepoint to binary and fill in the x’s.

As an example: U+0754 is between U+0080 and U+07FF, so it needs two bytes. 0x0754 in binary is 11101010100, so you replace the x’s with those digits:

11011101 10010100

187

answered Oct 04 '22 02:10

CharlotteBuff

Related questions
                            
                                Unicode characters in MATLAB source files
                            
                                What's the deal with Python 3.4, Unicode, different languages and Windows?
                            
                                How do I specify a range of unicode characters
                            
                                What is the difference between UTF-32 and UCS-4?
                            
                                Haskell, Char, Unicode, and Turkish
                            
                                Printing Unicode characters to the PowerShell prompt
                            
                                UTF-8 in Windows
                            
                                Does Postgresql varchar count using unicode character length or ASCII character length?
                            
                                UTF-8 Continuation bytes
                            
                                JavaScript remove ZERO WIDTH SPACE (unicode 8203) from string
                            
                                Using Haskell to output a UTF-8-encoded ByteString
                            
                                Reverse a string with accent chars?
                            
                                How to specify a unicode character using QString?
                            
                                Flex(lexer) support for unicode
                            
                                How to write unicode cross symbol in Java?
                            
                                How to convert Emoji from Unicode in PHP?
                            
                                Vertically center dots with CSS
                            
                                String To Lower/Upper in C++
                            
                                Unicode stored in C char
                            
                                How to use Special Chars in Java/Eclipse

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does UTF-8 encoding identify single byte and double byte characters?

Tags:

character-encoding

encoding

unicode

utf-8

Ganesh kumar S R

People also ask

1 Answers

CharlotteBuff

Recent Activity

Donate For Us