Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Obtain a list of the 143.859 Unicode Standard, Version 13.0.0 characters

Is it possible to obtain a list of all the 143,859 characters included in the 13.0.0 version of Unicode? I'm trying to print these 143,859 characters in python but was unable to find a comprehensive list of all the characters.

like image 675
nimish Avatar asked Dec 18 '25 14:12

nimish


2 Answers

To obtain a list of 143,859 characters you must exclude the same categories the unicode consortium has excluded in order to come up with that count.

import sys
from unicodedata import category, unidata_version

chars = []
for i in range(sys.maxunicode + 1):
    c = chr(i)
    cat = category(c)
    if cat == "Cc":  # control characters
        continue
    if cat == "Co":  # private use
        continue
    if cat == "Cs":  # surrogates
        continue
    if cat == "Cn":  # noncharacter or reserved
        continue
    chars.append(c)

print(f"Number of characters in Unicode v{unidata_version}: {len(chars):,}")

Output on my machine:

Number of characters in Unicode v13.0.0: 143,859
like image 173
wim Avatar answered Dec 20 '25 07:12

wim


I think your best bet is probably to read the UnicodeData.txt file as recommended by @wim in a comment below, then expand all the ranges that are marked off by <..., First> and <..., Last> in the second column, e.g., expand

3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

to

3400
3401
3402
...
4DBD
4DBE
4DBF

I haven't checked, but I'm guessing this would give you a pretty complete list.

Below are some other suggestions I made earlier, some of which could be useful.

Other Ideas

You can make a start with the Unicode Character Name Index, which is linked from the list of Unicode 13.0 Character Code Charts. However, that table has significant gaps and repetitions, e.g., all Latin capital letters are lumped under 0041 (A) and the group is identified a few different ways. Actually, the table is pretty incomplete -- it only has 2.759 unique codes.

Keying off of @wim's comment on the original post, another option might be to take a look at the source code for Python's unicodedata module. unicodename_db.h has some lists of codes that are read by _getucname in unicodedata.c. It looks like phrasebook may have a nearly complete list of codes (188,803 items), but possibly munged in some way (I don't have time to figure out the lookup/offset mechanism right now). In addition to those, Hangul syllables and unified ideographs are processed as ranges, not looked up from the phrasebook.

like image 38
Matthias Fripp Avatar answered Dec 20 '25 08:12

Matthias Fripp



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!