Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

namedtuple with unicode string as name

I'm having trouble assigning unicode strings as names for a namedtuple. This works:

a = collections.namedtuple("test", "value")

and this doesn't:

b = collections.namedtuple("βαδιζόντων", "value")

I get the error

Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib64/python3.4/collections/__init__.py", line 370, in namedtuple
        result = namespace[typename]
KeyError: 'βαδιζόντων'

Why is that the case? The documentation says, "Python 3 also supports using Unicode characters in identifiers," and the key is valid unicode?

like image 502
Thomas Avatar asked May 28 '15 10:05

Thomas


2 Answers

The problem is specifically with the letter (U+1F79 Greek small letter omicron with oxia). This is a ‘compatibility character’: Unicode would rather you use ό instead (U+03CC Greek small letter omicron with tonos). U+1F79 only exists in Unicode in order to round-trip to old character sets that distinguished between oxia and tonos, a distinction that later turned out to be incorrect.

When you use compatibility characters in an identifier, Python's source code parser automatically normalises them to form NFKC, so your class name ends up with U+03CC in it.

Unfortunately collections.namedtuple doesn't know about this. The way it creates the new class instance is by inserting the given name into a bunch of Python code in a string, then executing it (yuck, right?), and extracting the class from the resultant locals dict using its name... the original name, not the normalised version Python has actually compiled, so it fails.

This is a bug in collections which may be worth filing, but for now you should use the canonical character U+03CC ό.

like image 114
bobince Avatar answered Oct 06 '22 00:10

bobince


That ó is U+1F79 ɢʀᴇᴇᴋ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴏᴍɪᴄʀᴏɴ ᴡɪᴛʜ ᴏxɪᴀ. Python identifiers are normalized as NFKC, and U+1F79 in NFKC becomes U+03CC ɢʀᴇᴇᴋ sᴍᴀʟʟ ʟᴇᴛᴛᴇʀ ᴏᴍɪᴄʀᴏɴ ᴡɪᴛʜ ᴛᴏɴᴏs.

Interestingly, if you use the same string with U+1F79 replaced by U+03CC, it works.

>>> b = collections.namedtuple("βαδιζ\u03CCντων", "value")
>>>

The documentation for namedtuple claims that "Any valid Python identifier may be used for a fieldname". Both strings are valid Python identifiers, as can be easily tested in the interpreter.

>>> βαδιζόντων = 0
>>> βαδιζόντων = 0
>>>

This is definitely a bug in the implementation. I traced it to this bit in implementation of namedtuple:

namespace = dict(__name__='namedtuple_%s' % typename)
exec(class_definition, namespace)
result = namespace[typename] # here!

I guess that the typename left in the namespace dictionary by exec'ing the class_definition template, being a Python identifier, will be in NFKC form, and thus no longer match the actual value of the typename variable used to retrieve it. I believe simply pre-normalizing typename should fix this, but I haven't tested it.

like image 39
R. Martinho Fernandes Avatar answered Oct 05 '22 22:10

R. Martinho Fernandes