Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python, unicodedata name, and codepoint value, what am i missing?

Tags:

python

unicode

I'm dealing with unicode data characters, and I wonder why some do not have any name in unicodedata? Here is a sample code where you can check the < unknown >

I thought that every characters inside unicode database were named, BTW there are all of the same category that is [Cc] Other, Control.

Another question: how can I get the unicode code point value? Is it ord(unicodechar) that does the trick?

I also put the file here (as encoding is a weird thing), and because I think that my cut n' paste with 'invisible' character may be lossy.

#!/bin/env python
# -*- coding: utf-8 -*-

#extracted and licensing from here:
"""
:author: Laurent Pointal <[email protected]> <[email protected]>
:organization: CNRS - LIMSI
:copyright: CNRS - 2004-2009
:license: GNU-GPL Version 3 or greater
:version: $Id$
"""

# Chars alonemarks:
#         !?¿;,*¤@°:%|¦/()[]{}<>«»´`¨&~=#±£¥$©®"
# must have spaces around them to make them tokens.
# Notes: they may be in pchar or fchar too, to identify punctuation after
#        a fchar.
#        \202 is a special ,
#        \226 \227 are special -
alonemarks = u"!?¿;,\202*¤@°:%|¦/()[\]{}<>«»´`¨&~=#±\226"+\
     u"\227£¥$©®\""
import unicodedata
for x in alonemarks:
    unicodename = unicodedata.name(x, '<unknown>')
    print "\t".join(map(unicode, (x, len(x), ord(x), unicodename, unicodedata.category(x))))

    # unichr(int('fd9b', 16)).encode('utf-8')
    # http://stackoverflow.com/questions/867866/convert-unicode-codepoint-to-utf8-hex-in-python    
like image 725
user1340802 Avatar asked Apr 27 '12 09:04

user1340802


2 Answers

I thought that every characters inside unicode database were named

No, control characters don't have names, see UnicodeData file

Another question: how can I get the unicode code point value? Is it ord(unicodechar) that does the trick?

yes!

print '%x' % ord(unicodedata.lookup('LATIN LETTER SMALL CAPITAL Z'))
## 1d22
like image 171
georg Avatar answered Sep 24 '22 03:09

georg


As per unicodedata library documentation,

The module uses the same names and symbols as defined by the UnicodeData File Format 5.2.0 (see here)

Your two characters display the following output:

1   150 <unknown>   Cc
1   151 <unknown>   Cc

They correspond to control points characters 0x96 and 0x97 The unicode documentation above stipulates in the code point paragraph that:

Surrogate code points, private-use characters, control codes, noncharacters, and unassigned code points have no names.

I don't know how to get the label comment corresponding to the unicode comments through unicodedata module, but I think you don't get any name for your two control characters because it is defined that way by Unicode norm.

like image 38
Zeugma Avatar answered Sep 26 '22 03:09

Zeugma