UTF-8 & Unicode, what's with 0xC0 and 0x80?

Tags:

utf-8

I've been reading about Unicode and UTF-8 in the last couple of days and I often come across a bitwise comparison similar to this :

int strlen_utf8(char *s)  {   int i = 0, j = 0;   while (s[i])    {     if ((s[i] & 0xc0) != 0x80) j++;     i++;   }   return j; }

Can someone clarify the comparison with 0xc0 and checking if it's the most significant bit ?

Thank you!

EDIT: ANDed, not comparison, used the wrong word ;)

487

asked Oct 12 '10 03:10

1 Answers

It's not a comparison with 0xc0, it's a logical AND operation with 0xc0.

The bit mask 0xc0 is 11 00 00 00 so what the AND is doing is extracting only the top two bits:

    ab cd ef gh AND 11 00 00 00     -- -- -- --   = ab 00 00 00

This is then compared to 0x80 (binary 10 00 00 00). In other words, the if statement is checking to see if the top two bits of the value are not equal to 10.

"Why?", I hear you ask. Well, that's a good question. The answer is that, in UTF-8, all bytes that begin with the bit pattern 10 are subsequent bytes of a multi-byte sequence:

                    UTF-8 Range              Encoding  Binary value -----------------  --------  -------------------------- U+000000-U+00007f  0xxxxxxx  0xxxxxxx  U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx                    10xxxxxx  U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx                    10yyyyxx                    10xxxxxx  U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx                    10zzyyyy                    10yyyyxx                    10xxxxxx

So, what this little snippet is doing is going through every byte of your UTF-8 string and counting up all the bytes that aren't continuation bytes (i.e., it's getting the length of the string, as advertised). See this wikipedia link for more detail and Joel Spolsky's excellent article for a primer.

An interesting aside by the way. You can classify bytes in a UTF-8 stream as follows:

With the high bit set to 0, it's a single byte value.
With the two high bits set to 10, it's a continuation byte.
Otherwise, it's the first byte of a multi-byte sequence and the number of leading 1 bits indicates how many bytes there are in total for this sequence (110... means two bytes, 1110... means three bytes, etc).

127

answered Sep 23 '22 09:09

paxdiablo

Related questions
                            
                                Should I support Unicode in passwords?
                            
                                Removing u in list
                            
                                Why does Java allow control characters in its identifiers?
                            
                                How can I get the Unicode code point(s) of a Character?
                            
                                Is there a Windows command shell that will display Unicode characters?
                            
                                What's the purpose of the noncharacters U+FDD0 to U+FDEF?
                            
                                UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-6: invalid data
                            
                                Do UTF-8, UTF-16, and UTF-32 differ in the number of characters they can store?
                            
                                Python DictWriter writing UTF-8 encoded CSV files
                            
                                A unicode newline character(\u000d) in Java
                            
                                How to make python 3 print() utf8
                            
                                How to unquote a urlencoded unicode string in python?
                            
                                CSS:after encoding characters in content
                            
                                Convert unicode string dictionary into dictionary in python
                            
                                Convert International String to \u Codes in java
                            
                                Unicode identifiers in Python?
                            
                                Python string to unicode [duplicate]
                            
                                How to write unicode strings into a file? [duplicate]
                            
                                Light C Unicode Library [closed]
                            
                                What do I need to know about Unicode? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

UTF-8 & Unicode, what's with 0xC0 and 0x80?

Tags:

unicode

utf-8

vdsf

People also ask

1 Answers

paxdiablo

Recent Activity

Donate For Us