Where can I find a Unicode table showing only the simplified Chinese characters? I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt
. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥']
and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With