Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the encoding of Chinese characters on Wikipedia?

I was looking at the encoding of Chinese characters on Wikipedia and I'm having trouble figuring out what they are using. For instance "的" is encoded as "%E7%9A%84" (see here). That's three bytes, however none of the encodings described on this page uses three bytes to represent Chinese characters. UTF-8 for instance uses 2 bytes.

I'm basically trying to match these three bytes to an actual character. Any suggestion on what encoding it could be?

like image 231
laurent Avatar asked Apr 10 '11 05:04

laurent


People also ask

What encoding do Chinese characters use?

UTF-8 is a character encoding system. It lets you represent characters as ASCII text, while still allowing for international characters, such as Chinese characters. As of the mid 2020s, UTF-8 is one of the most popular encoding systems.

Are Chinese characters UTF-8 or UTF 16?

It's not that UTF-8 doesn't cover Chinese characters and UTF-16 does. UTF-16 uses uniformly 16 bits to represent a character; while UTF-8 uses 1, 2, 3, up to a max of 4 bytes, depending on the character, so that an ASCII character is represented still as 1 byte.

What encoding does Wikipedia use?

What character encoding does Wikipedia use? From MediaWiki 1.5, all projects use Unicode (UTF-8) character encoding.

What is the Unicode range for Chinese characters?

The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,992 basic Chinese characters in the range U+4E00 through U+9FFF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea.


2 Answers

The header of a wikipedia page includes this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 

So the page is UTF-8.

like image 44
Adam Avatar answered Sep 23 '22 23:09

Adam


 >>> c='\xe7\x9a\x84'.decode('utf8') >>> c u'\u7684' >>> print c 的 


though Unicode encodes it in 16 bits, utf8 breaks it down to 3 bytes.
like image 198
jcomeau_ictx Avatar answered Sep 23 '22 23:09

jcomeau_ictx