Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Some utf8 chars allowed in python source, some not

I've noticed that I can not use all unicode characters in my python source code.

While

def 价(何):

is perfectly allright (albeit nonsensical [probably?]),

def N(N₀, t, λ) -> 'N(t)':

this isn't allowed (the subscript zero that is).

I also can't use some other characters, most of which I recognise as something other than letters (mathematical operators for example). I always thought that if I just stick to the rules I know, i.e. composing names from letters and numbers, with a letter as the first character, all will be okay. Now, the subscript zero is clearly a 'number'. so my impression was wrong.

I know I should avoid using special characters. However, the function definition above (the exponential decay one that is) seems to me perfectly reasonable - because it will never change, and it so elegantly conveys all the information needed for another programmer to use it.

My question therefore, exactly which characters are allowed and which aren't? And where?

Edit
All right I seem not to have been clear enough. I am using python3, so there is no need for declaring the encoding of the source file. Apparent I thought from then fact that my Chinese function definition works.

My question concerns why some characters are allowed there, while others aren't. The subscript zero raises an error, invalid character in identifier, but the blackboard bold zero works. Both equally special I'd say.

I'd like to know if there are any general rules that apply not just to my situation, there must be. It seems that my error is not an accident.

Edit 2:

The answer courtesy of Beau Martínez, pointing me to the language reference, where i should have looked in the first place:

http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html It appears the characters that are allowed are all chosen.

like image 920
Stefano Palazzo Avatar asked Aug 13 '10 07:08

Stefano Palazzo


People also ask

How do you escape a Unicode character in Python?

You can use escapes \u and \U to specify Unicode characters with 4 and 8 hexadecimal digits respectively. The below snippet also shows how to get codepoints (numerical value of a character) in Python.

Does Python use UTF-8?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding.

Are Python strings UTF-8?

The popular encodings being utf-8, ascii, etc. Using the string encode() method, you can convert unicode strings into any encodings supported by Python. By default, Python uses utf-8 encoding.

How do you use Unicode characters in Python?

To include Unicode characters in your Python source code, you can use Unicode escape characters in the form \u0123 in your string. In Python 2. x, you also need to prefix the string literal with 'u'.


2 Answers

As per the language reference, Python 3 allows a large variety of characters as identifiers.

That zero subscript character seems like a number, but it isn't for Python; Python only treats 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 as numbers. It is in fact a character, so you can use it as an identifier (it's as if it were, instead, for example, a greek character such as Phi).

Importantly, how easily can you type those characters with your keyboard? I don't want to pull up the character map every time I have to call your functions, for example. Calling it "maximum_decay_rate" or something much more intuitive to any user, not just a Physics major, makes your code more readable.

If you say it isn't allowed, it's probably because you haven't specified the character encoding for your source file. It can be specified by having # -*- coding: utf-8 -*- (or which ever the encoding) at the beginning of your source file.

like image 189
Humphrey Bogart Avatar answered Oct 03 '22 00:10

Humphrey Bogart


Tell Python what the proper encoding is:

https://www.python.org/dev/peps/pep-0263/

Either...

# -*- coding: utf-8 -*-

or

# coding=utf-8

As far as what characters are actually allowed in variable names, typically the restriction is alphabetic characters, digits, and underscores.

The "subscript zero" is not actually a digit, it's a subscript.

like image 29
Amber Avatar answered Oct 03 '22 01:10

Amber