Python, unicodedata name, and codepoint value, what am i missing?

Question

I'm dealing with unicode data characters, and I wonder why some do not have any name in unicodedata? Here is a sample code where you can check the < unknown >

I thought that every characters inside unicode database were named, BTW there are all of the same category that is [Cc] Other, Control.

Another question: how can I get the unicode code point value? Is it ord(unicodechar) that does the trick?

I also put the file here (as encoding is a weird thing), and because I think that my cut n' paste with 'invisible' character may be lossy.

#!/bin/env python
# -*- coding: utf-8 -*-

#extracted and licensing from here:
"""
:author: Laurent Pointal <laurent.pointal@limsi.fr> <laurent.pointal@laposte.net>
:organization: CNRS - LIMSI
:copyright: CNRS - 2004-2009
:license: GNU-GPL Version 3 or greater
:version: $Id$
"""

# Chars alonemarks:
#         !?¿;,*¤@°:%|¦/()[]{}<>«»´`¨&~=#±£¥$©®"
# must have spaces around them to make them tokens.
# Notes: they may be in pchar or fchar too, to identify punctuation after
#        a fchar.
#        \202 is a special ,
#        \226 \227 are special -
alonemarks = u"!?¿;,\202*¤@°:%|¦/()[\]{}<>«»´`¨&~=#±\226"+\
     u"\227£¥$©®\""
import unicodedata
for x in alonemarks:
    unicodename = unicodedata.name(x, '<unknown>')
    print "	".join(map(unicode, (x, len(x), ord(x), unicodename, unicodedata.category(x))))

    # unichr(int('fd9b', 16)).encode('utf-8')
    # http://stackoverflow.com/questions/867866/convert-unicode-codepoint-to-utf8-hex-in-python

georg · Accepted Answer

I thought that every characters inside unicode database were named

No, control characters don't have names, see UnicodeData file

Another question: how can I get the unicode code point value? Is it ord(unicodechar) that does the trick?

yes!

print '%x' % ord(unicodedata.lookup('LATIN LETTER SMALL CAPITAL Z'))
## 1d22

Zeugma · Answer

As per unicodedata library documentation,

The module uses the same names and symbols as defined by the UnicodeData File Format 5.2.0 (see here)

Your two characters display the following output:

1   150 <unknown>   Cc
1   151 <unknown>   Cc

They correspond to control points characters 0x96 and 0x97 The unicode documentation above stipulates in the code point paragraph that:

Surrogate code points, private-use characters, control codes, noncharacters, and unassigned code points have no names.

I don't know how to get the label comment corresponding to the unicode comments through unicodedata module, but I think you don't get any name for your two control characters because it is defined that way by Unicode norm.

Python, unicodedata name, and codepoint value, what am i missing?

Tags:

python

unicode

user1340802

2 Answers

georg

Zeugma

Recent Activity

Donate For Us

Python, unicodedata name, and codepoint value, what am i missing?

Tags:

python

unicode

user1340802

2 Answers

georg

Zeugma

Related questions

Recent Activity

Donate For Us