Obtain a list of the 143.859 Unicode Standard, Version 13.0.0 characters

Question

Is it possible to obtain a list of all the 143,859 characters included in the 13.0.0 version of Unicode? I'm trying to print these 143,859 characters in python but was unable to find a comprehensive list of all the characters.

wim · Accepted Answer

To obtain a list of 143,859 characters you must exclude the same categories the unicode consortium has excluded in order to come up with that count.

import sys
from unicodedata import category, unidata_version

chars = []
for i in range(sys.maxunicode + 1):
    c = chr(i)
    cat = category(c)
    if cat == "Cc":  # control characters
        continue
    if cat == "Co":  # private use
        continue
    if cat == "Cs":  # surrogates
        continue
    if cat == "Cn":  # noncharacter or reserved
        continue
    chars.append(c)

print(f"Number of characters in Unicode v{unidata_version}: {len(chars):,}")

Output on my machine:

Number of characters in Unicode v13.0.0: 143,859

Matthias Fripp · Answer

I think your best bet is probably to read the UnicodeData.txt file as recommended by @wim in a comment below, then expand all the ranges that are marked off by <..., First> and <..., Last> in the second column, e.g., expand

3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

to

3400
3401
3402
...
4DBD
4DBE
4DBF

I haven't checked, but I'm guessing this would give you a pretty complete list.

Below are some other suggestions I made earlier, some of which could be useful.

Other Ideas

You can make a start with the Unicode Character Name Index, which is linked from the list of Unicode 13.0 Character Code Charts. However, that table has significant gaps and repetitions, e.g., all Latin capital letters are lumped under 0041 (A) and the group is identified a few different ways. Actually, the table is pretty incomplete -- it only has 2.759 unique codes.

Keying off of @wim's comment on the original post, another option might be to take a look at the source code for Python's unicodedata module. unicodename_db.h has some lists of codes that are read by _getucname in unicodedata.c. It looks like phrasebook may have a nearly complete list of codes (188,803 items), but possibly munged in some way (I don't have time to figure out the lookup/offset mechanism right now). In addition to those, Hangul syllables and unified ideographs are processed as ranges, not looked up from the phrasebook.

Obtain a list of the 143.859 Unicode Standard, Version 13.0.0 characters

Tags:

python

unicode

python-unicode

nimish

2 Answers

wim

Matthias Fripp

Recent Activity

Donate For Us

Obtain a list of the 143.859 Unicode Standard, Version 13.0.0 characters

Tags:

python

unicode

python-unicode

nimish

2 Answers

wim

Matthias Fripp

Related questions

Recent Activity

Donate For Us