I'm dealing with unicode data characters, and I wonder why some do not have any name in unicodedata? Here is a sample code where you can check the < unknown >
I thought that every characters inside unicode database were named, BTW there are all of the same category that is [Cc] Other, Control
.
Another question: how can I get the unicode code point value? Is it ord(unicodechar)
that does the trick?
I also put the file here (as encoding is a weird thing), and because I think that my cut n' paste with 'invisible' character may be lossy.
#!/bin/env python
# -*- coding: utf-8 -*-
#extracted and licensing from here:
"""
:author: Laurent Pointal <[email protected]> <[email protected]>
:organization: CNRS - LIMSI
:copyright: CNRS - 2004-2009
:license: GNU-GPL Version 3 or greater
:version: $Id$
"""
# Chars alonemarks:
# !?¿;,*¤@°:%|¦/()[]{}<>«»´`¨&~=#±£¥$©®"
# must have spaces around them to make them tokens.
# Notes: they may be in pchar or fchar too, to identify punctuation after
# a fchar.
# \202 is a special ,
# \226 \227 are special -
alonemarks = u"!?¿;,\202*¤@°:%|¦/()[\]{}<>«»´`¨&~=#±\226"+\
u"\227£¥$©®\""
import unicodedata
for x in alonemarks:
unicodename = unicodedata.name(x, '<unknown>')
print "\t".join(map(unicode, (x, len(x), ord(x), unicodename, unicodedata.category(x))))
# unichr(int('fd9b', 16)).encode('utf-8')
# http://stackoverflow.com/questions/867866/convert-unicode-codepoint-to-utf8-hex-in-python
I thought that every characters inside unicode database were named
No, control characters don't have names, see UnicodeData file
Another question: how can I get the unicode code point value? Is it ord(unicodechar) that does the trick?
yes!
print '%x' % ord(unicodedata.lookup('LATIN LETTER SMALL CAPITAL Z'))
## 1d22
As per unicodedata
library documentation,
The module uses the same names and symbols as defined by the UnicodeData File Format 5.2.0 (see here)
Your two characters display the following output:
1 150 <unknown> Cc
1 151 <unknown> Cc
They correspond to control points characters 0x96 and 0x97 The unicode documentation above stipulates in the code point paragraph that:
Surrogate code points, private-use characters, control codes, noncharacters, and unassigned code points have no names.
I don't know how to get the label comment corresponding to the unicode comments through unicodedata
module, but I think you don't get any name for your two control characters because it is defined that way by Unicode norm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With