Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Test if char* string contains multibyte characters

I receive a byte stream buffer from a TCP server which could contain multibyte characters forming unicode characters. I was wondering if there's always a way to check for BOM to detect those characters or else how would you like to do it?

like image 522
cpx Avatar asked Feb 16 '11 05:02

cpx


People also ask

Is multibyte character?

A multibyte character is a character composed of sequences of one or more bytes. Each byte sequence represents a single character in the extended character set. Multibyte characters are used in character sets such as Kanji. Wide characters are multilingual character codes that are always 16 bits wide.

Is UTF-8 a multibyte?

UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes.

What is a multibyte string?

A null-terminated multibyte string (NTMBS), or "multibyte string", is a sequence of nonzero bytes followed by a byte with value zero (the terminating null character). Each character stored in the string may occupy more than one byte.

What is difference between Unicode and multibyte character set?

Unicode is a 16-bit character encoding, providing enough encodings for all languages. All ASCII characters are included in Unicode as widened characters. Support for a form of multibyte character set (MBCS) called double-byte character set (DBCS) on all platforms. DBCS characters are composed of 1 or 2 bytes.


1 Answers

If you know that the data is UTF-8, then you just have to check the high bit:

  • 0xxxxxxx = single-byte ASCII character
  • 1xxxxxxx = part of multi-byte character

Or, if you need to distinguish lead/trail bytes:

  • 10xxxxxx = 2nd, 3rd, or 4th byte of multi-byte character
  • 110xxxxx = 1st byte of 2-byte character
  • 1110xxxx = 1st byte of 3-byte character
  • 11110xxx = 1st byte of 4-byte character
like image 61
dan04 Avatar answered Sep 24 '22 11:09

dan04