Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does GB18030 differ from Unicode?

Tags:

unicode

How does the Chinese GB18030 code set differ from Unicode?

What special techniques are required for handling GB18030?

Are there any (open source) libraries for handling GB18030?

like image 267
Jonathan Leffler Avatar asked Oct 21 '08 20:10

Jonathan Leffler


2 Answers

As per the Wikipedia article on GB18030, "GB18030 can be be considered a Unicode Transformation Format (i.e. an encoding of all Unicode code points) that maintains compatibility with a legacy character set." That is, all Unicode characters can be encoded in GB18030, but they will be encoded with different byte sequences than would be generated with UTF-8 or UTF-16. Handling the GB18030 encoding doesn't require any more special techniques than are required for any other non-Unicode encoding.

The ICU project is an open source library (for C or Java) that has full support for many different encodings, including GB18030. Information on converting between different encodings with ICU can be found here.

like image 125
Bradley Grainger Avatar answered Sep 23 '22 15:09

Bradley Grainger


What special techniques are required for handling GB18030?

The biggest thing to be aware of is that, unlike UTF-8, GB18030 allows ASCII bytes to occur within the encoding of a multi-byte character. (For example, 'ß' is encoded as the bytes 81 30 89 38, which contains the ASCII encoding of '0' and '8'.) This means that you can't use a simple byte-oriented find/index function.

like image 32
dan04 Avatar answered Sep 23 '22 15:09

dan04