Python allows unicode identifiers. I defined Xᵘ = 42
, expecting XU
and Xᵤ
to result in a NameError
. But in reality, when I define Xᵘ
, Python (silently?) turns Xᵘ
into Xu
, which strikes me as somewhat of an unpythonic thing to do. Why is this happening?
>>> Xᵘ = 42
>>> print((Xu, Xᵘ, Xᵤ))
(42, 42, 42)
Python converts all identifiers to their NFKC normal form; from the Identifiers section of the reference documentation:
All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.
The NFKC form of both the super and subscript characters is the lowercase u
:
>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'Xᵘ Xᵤ')
'Xu Xu'
So in the end, all you have is a single identifier, Xu
:
>>> import dis
>>> dis.dis(compile('Xᵘ = 42\nprint((Xu, Xᵘ, Xᵤ))', '', 'exec'))
1 0 LOAD_CONST 0 (42)
2 STORE_NAME 0 (Xu)
2 4 LOAD_NAME 1 (print)
6 LOAD_NAME 0 (Xu)
8 LOAD_NAME 0 (Xu)
10 LOAD_NAME 0 (Xu)
12 BUILD_TUPLE 3
14 CALL_FUNCTION 1
16 POP_TOP
18 LOAD_CONST 1 (None)
20 RETURN_VALUE
The above disassembly of the compiled bytecode shows that the identifiers have been normalised during compilation; this happens during parsing, any identifiers are normalised when creating the AST (Abstract Parse Tree) which the compiler uses to produce bytecode.
Identifiers are normalized to avoid many potential 'look-alike' bugs, where you'd otherwise could end up using both find()
(using the U+FB01 LATIN SMALL LIGATURE FI character followed by the ASCII nd
characters) and find()
and wonder why your code has a bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With