How does GB18030 differ from Unicode?

Question

How does the Chinese GB18030 code set differ from Unicode?

What special techniques are required for handling GB18030?

Are there any (open source) libraries for handling GB18030?

Bradley Grainger · Accepted Answer

As per the Wikipedia article on GB18030, "GB18030 can be be considered a Unicode Transformation Format (i.e. an encoding of all Unicode code points) that maintains compatibility with a legacy character set." That is, all Unicode characters can be encoded in GB18030, but they will be encoded with different byte sequences than would be generated with UTF-8 or UTF-16. Handling the GB18030 encoding doesn't require any more special techniques than are required for any other non-Unicode encoding.

The ICU project is an open source library (for C or Java) that has full support for many different encodings, including GB18030. Information on converting between different encodings with ICU can be found here.

dan04 · Answer

What special techniques are required for handling GB18030?

The biggest thing to be aware of is that, unlike UTF-8, GB18030 allows ASCII bytes to occur within the encoding of a multi-byte character. (For example, 'ß' is encoded as the bytes 81 30 89 38, which contains the ASCII encoding of '0' and '8'.) This means that you can't use a simple byte-oriented find/index function.

How does GB18030 differ from Unicode?

Tags:

unicode

Jonathan Leffler

2 Answers

Bradley Grainger

dan04

Recent Activity

Donate For Us

How does GB18030 differ from Unicode?

Tags:

unicode

Jonathan Leffler

2 Answers

Bradley Grainger

dan04

Related questions

Recent Activity

Donate For Us