Is it possible to obtain a list of all the 143,859 characters included in the 13.0.0 version of Unicode? I'm trying to print these 143,859 characters in python but was unable to find a comprehensive list of all the characters.
To obtain a list of 143,859 characters you must exclude the same categories the unicode consortium has excluded in order to come up with that count.
import sys
from unicodedata import category, unidata_version
chars = []
for i in range(sys.maxunicode + 1):
c = chr(i)
cat = category(c)
if cat == "Cc": # control characters
continue
if cat == "Co": # private use
continue
if cat == "Cs": # surrogates
continue
if cat == "Cn": # noncharacter or reserved
continue
chars.append(c)
print(f"Number of characters in Unicode v{unidata_version}: {len(chars):,}")
Output on my machine:
Number of characters in Unicode v13.0.0: 143,859
I think your best bet is probably to read the UnicodeData.txt file as recommended by @wim in a comment below, then expand all the ranges that are marked off by <..., First> and <..., Last> in the second column, e.g., expand
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
to
3400
3401
3402
...
4DBD
4DBE
4DBF
I haven't checked, but I'm guessing this would give you a pretty complete list.
Below are some other suggestions I made earlier, some of which could be useful.
Other Ideas
You can make a start with the Unicode Character Name Index, which is linked from the list of Unicode 13.0 Character Code Charts. However, that table has significant gaps and repetitions, e.g., all Latin capital letters are lumped under 0041 (A) and the group is identified a few different ways. Actually, the table is pretty incomplete -- it only has 2.759 unique codes.
Keying off of @wim's comment on the original post, another option might be to take a look at the source code for Python's unicodedata module. unicodename_db.h has some lists of codes that are read by _getucname in unicodedata.c. It looks like phrasebook may have a nearly complete list of codes (188,803 items), but possibly munged in some way (I don't have time to figure out the lookup/offset mechanism right now). In addition to those, Hangul syllables and unified ideographs are processed as ranges, not looked up from the phrasebook.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With