Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently list all characters in a given Unicode category

Often one wants to list all characters in a given Unicode category. For example:

  • List all Unicode whitespace, How can I get all whitespaces in UTF-8 in Python?
  • Characters with the property Alphabetic

It is possible to produce this list by iterating over all Unicode code-points and testing for the desired category (Python 3):

[c for c in map(chr, range(0x110000)) if unicodedata.category(c) in ('Ll',)]

or using regexes,

re.findall(r'\s', ''.join(map(chr, range(0x110000))))

But these methods are slow. Is there a way to look up a list of characters in the category without having to iterate over all of them?

Related question for Perl: How do I get a list of all Unicode characters that have a given property?

like image 603
Mechanical snail Avatar asked Jan 09 '13 20:01

Mechanical snail


People also ask

What characters are Unicode?

Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text.

How many characters are there in Unicode?

As of Unicode version 14.0, there are 144,697 characters with code points, covering 159 modern and historical scripts, as well as multiple symbol sets.

What is Unicode general category?

A Unicode general category defines the broad classification of a character, that is, designation as a type of letter, decimal digit, separator, mathematical symbol, punctuation, and so on. This enumeration is based on The Unicode Standard, version 5.0.

What is a Unicode alphabetic character?

The alphabetic characters are those UNICODE characters which are defined as letters by the UNICODE standard, e.g., the ASCII characters. ABCDEFGHIJKLMNOPQRSTUVWXYZ. abcdefghijklmnopqrstuvwxyz. and the international alphabetic characters from the character set: ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜßáíóúñѪºãõØøÀÃÕ


1 Answers

If you need to do this often, it's easy enough to build yourself a re-usable map:

import sys
import unicodedata
from collections import defaultdict

unicode_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
    unicode_category[unicodedata.category(c)].append(c)

And from there on out use that map to translate back to a series of characters for a given category:

alphabetic = unicode_category['Ll']

If this is too costly for start-up time, consider dumping that structure to a file; loading this mapping from a JSON file or other quick-to-parse-to-dict format should not be too painful.

Once you have the mapping, looking up a category is done in constant time of course.

like image 172
Martijn Pieters Avatar answered Sep 22 '22 11:09

Martijn Pieters