What is the proper way to identify unicode private use characters in python 3? There's nothing obviously relevant in the module unicodedata
, which which makes it easy to look up character names and attributes.
Some background: unicodedata.name()
, which gives the name of unicode characters, will raise a ValueError
if called with a private use character (e.g., try unicodedata.name("\uf026")
). But whitespace characters (except for space itself), and possibly other things, also trigger an exception. So what's a non-hacky, reliable way to detect PUA characters?
I have to test if '⸣' is not in myvariable: myvariable type is already <type 'unicode'> , whereas the unicode character '⸣' (Unicode Code Point U+2E23) is out of the range of ASCII characters. Moreover the scripts already make use of the pragma # -*- coding: utf-8 -*- .
In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: u'abcdefghijk' . Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.
To include Unicode characters in your Python source code, you can use Unicode escape characters in the form \u0123 in your string. In Python 2. x, you also need to prefix the string literal with 'u'.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.
Private use characters are all in the Co
general category, as returned by category()
in unicodedata
:
>>> import unicodedata
>>> def is_pua(c):
... return unicodedata.category(c) == 'Co'
...
>>> is_pua(u'\uF026')
True
Given that the Unicode Standard guarantees that the set of private use characters will never change (no characters will ever be added or removed), it's also safe to hard-code the three ranges:
U+E000
to U+F8FF
U+F0000
to U+FFFFD
U+100000
to U+10FFFD
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With