I was looking at the encoding of Chinese characters on Wikipedia and I'm having trouble figuring out what they are using. For instance "的" is encoded as "%E7%9A%84" (see here). That's three bytes, however none of the encodings described on this page uses three bytes to represent Chinese characters. UTF-8 for instance uses 2 bytes.
I'm basically trying to match these three bytes to an actual character. Any suggestion on what encoding it could be?
UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.
It's not that UTF-8 doesn't cover Chinese characters and UTF-16 does. UTF-16 uses uniformly 16 bits to represent a character; while UTF-8 uses 1, 2, 3, up to a max of 4 bytes, depending on the character, so that an ASCII character is represented still as 1 byte.
What character encoding does Wikipedia use? From MediaWiki 1.5, all projects use Unicode (UTF-8) character encoding.
The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,992 basic Chinese characters in the range U+4E00 through U+9FFF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea.
The header of a wikipedia page includes this:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
So the page is UTF-8.
>>> c='\xe7\x9a\x84'.decode('utf8') >>> c u'\u7684' >>> print c 的
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With