Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting unicode private use area characters with python

What is the proper way to identify unicode private use characters in python 3? There's nothing obviously relevant in the module unicodedata, which which makes it easy to look up character names and attributes.

Some background: unicodedata.name(), which gives the name of unicode characters, will raise a ValueError if called with a private use character (e.g., try unicodedata.name("\uf026")). But whitespace characters (except for space itself), and possibly other things, also trigger an exception. So what's a non-hacky, reliable way to detect PUA characters?

like image 283
alexis Avatar asked Sep 12 '15 14:09

alexis


People also ask

How do you identify Unicode characters in a string in Python?

I have to test if '⸣' is not in myvariable: myvariable type is already <type 'unicode'> , whereas the unicode character '⸣' (Unicode Code Point U+2E23) is out of the range of ASCII characters. Moreover the scripts already make use of the pragma # -*- coding: utf-8 -*- .

How do you escape a Unicode character in Python?

In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: u'abcdefghijk' . Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. The \U escape sequence is similar, but expects 8 hex digits, not 4.

How do you use Unicode characters in Python?

To include Unicode characters in your Python source code, you can use Unicode escape characters in the form \u0123 in your string. In Python 2. x, you also need to prefix the string literal with 'u'.

Can Python handle Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.


1 Answers

Private use characters are all in the Co general category, as returned by category() in unicodedata:

>>> import unicodedata
>>> def is_pua(c):
...   return unicodedata.category(c) == 'Co'
...
>>> is_pua(u'\uF026')
True

Given that the Unicode Standard guarantees that the set of private use characters will never change (no characters will ever be added or removed), it's also safe to hard-code the three ranges:

  • U+E000 to U+F8FF
  • U+F0000 to U+FFFFD
  • U+100000 to U+10FFFD
like image 99
一二三 Avatar answered Sep 26 '22 17:09

一二三