Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the upper and lower bound for Chinese char in UTF-8?

Tags:

python

cjk

I would like to make a set in python contains all the ord() of the Chinese chars:

for English the equivalent is :

english = set(range(ord('a'),ord('z') + 1 ) +
              range(ord('A'),ord('Z') + 1 ))
like image 411
0x90 Avatar asked Dec 16 '22 04:12

0x90


1 Answers

From the Unicode Standard (v6.0, section 12.1),

Han ideographic characters are found in seven main blocks of the Unicode Standard, as shown in Table 12-2

Table 12-2. Blocks Containing Han Ideographs

Block                                   | Range       | Comment
----------------------------------------+-------------+-----------------------------------------------------
CJK Unified Ideographs                  | 4E00–9FFF   | Common
CJK Unified Ideographs Extension A      | 3400–4DBF   | Rare
CJK Unified Ideographs Extension B      | 20000–2A6DF | Rare, historic
CJK Unified Ideographs Extension C      | 2A700–2B73F | Rare, historic
CJK Unified Ideographs Extension D      | 2B740–2B81F | Uncommon, some in current use
CJK Compatibility Ideographs            | F900–FAFF   | Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants

And there are a couple of extras, outside of these blocks:

Table 12-3. Small Extensions to the URO

Range     | Version | Comment
----------+---------+-------------------------------------------------
9FA6–9FB3 | 4.1     | Interoperability with HKSCS standard
9FB4–9FBB | 4.1     | Interoperability with GB 18030 standard
9FBC–9FC2 | 5.1     | Interoperability with commercial implementations
9FC3      | 5.1     | Correction of mistaken unification
9FC4–9FC6 | 5.2     | Interoperability with ARIB standard
9FC7–9FCB | 5.2     | Interoperability with HKSCS standard

To use set operations to construct a set of the ordinal values of these, you can do this:

chinese = set(range(0x4E00, 0xA000) +
              range(0x3400, 0x4DC0) +
              range(0x20000, 0x2A6E0) +
              range(0x2A700, 0x2B740) +
              range(0x2B740, 0x2B820) +
              range(0xF900, 0xFB00) +
              range(0x2F800, 0x2FA20) +
              range(0x9FA6, 0x9FCC))

Be aware, though, that this set contains over 75000 characters, so it may not be the most compact or efficient data structure for this.

Also, if you insist on using ord() on literal characters, you will need to use the 32-bit unicode literal form:

>>> ord(u'\U00002F800')
194560
like image 190
Ian Clelland Avatar answered Feb 01 '23 23:02

Ian Clelland