I would like to make a set in python contains all the ord()
of the Chinese chars:
for English the equivalent is :
english = set(range(ord('a'),ord('z') + 1 ) +
range(ord('A'),ord('Z') + 1 ))
From the Unicode Standard (v6.0, section 12.1),
Han ideographic characters are found in seven main blocks of the Unicode Standard, as shown in Table 12-2
Table 12-2. Blocks Containing Han Ideographs
Block | Range | Comment
----------------------------------------+-------------+-----------------------------------------------------
CJK Unified Ideographs | 4E00–9FFF | Common
CJK Unified Ideographs Extension A | 3400–4DBF | Rare
CJK Unified Ideographs Extension B | 20000–2A6DF | Rare, historic
CJK Unified Ideographs Extension C | 2A700–2B73F | Rare, historic
CJK Unified Ideographs Extension D | 2B740–2B81F | Uncommon, some in current use
CJK Compatibility Ideographs | F900–FAFF | Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants
And there are a couple of extras, outside of these blocks:
Table 12-3. Small Extensions to the URO
Range | Version | Comment
----------+---------+-------------------------------------------------
9FA6–9FB3 | 4.1 | Interoperability with HKSCS standard
9FB4–9FBB | 4.1 | Interoperability with GB 18030 standard
9FBC–9FC2 | 5.1 | Interoperability with commercial implementations
9FC3 | 5.1 | Correction of mistaken unification
9FC4–9FC6 | 5.2 | Interoperability with ARIB standard
9FC7–9FCB | 5.2 | Interoperability with HKSCS standard
To use set operations to construct a set of the ordinal values of these, you can do this:
chinese = set(range(0x4E00, 0xA000) +
range(0x3400, 0x4DC0) +
range(0x20000, 0x2A6E0) +
range(0x2A700, 0x2B740) +
range(0x2B740, 0x2B820) +
range(0xF900, 0xFB00) +
range(0x2F800, 0x2FA20) +
range(0x9FA6, 0x9FCC))
Be aware, though, that this set contains over 75000 characters, so it may not be the most compact or efficient data structure for this.
Also, if you insist on using ord() on literal characters, you will need to use the 32-bit unicode literal form:
>>> ord(u'\U00002F800')
194560
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With