How can get Python isidentifer() functionality in Python 2.6?

Question

Python 3 has a string method called str.isidentifier

How can I get similar functionality in Python 2.6, short of rewriting my own regex, etc.?

John La Rooy · Accepted Answer

the tokenize module defines a regexp called Name

import re, tokenize, keyword
re.match(tokenize.Name + '$', somestr) and not keyword.iskeyword(somestr)

ch0wner · Answer

Invalid Identifier Validation

All of the answers in this thread seem to be repeating a mistake in the validation which allows strings that are not valid identifiers to be matched like ones.

The regex patterns suggested in the other answers are built from tokenize.Name which holds the following regex pattern [a-zA-Z_]\w* (running python 2.7.15) and the '$' regex anchor.

Please refer to the official python 3 description of the identifiers and keywords (which contains a paragraph that is relevant to python 2 as well).

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

thus 'foo ' should not be considered as a valid identifier.

While one may argue that this code is functional:

>>>  class Foo():
>>>     pass
>>> f = Foo()
>>> setattr(f, 'foo
', 'bar')
>>> dir(f)
['__doc__', '__module__', 'foo
']
>>> print getattr(f, 'foo
')
bar

As the newline character is indeed a valid ASCII character, it is not considered to be a letter. Further more, there is clearly no practical use of an identifer that ends with a newline character

>>> f.foo

SyntaxError: unexpected character after line continuation character

The str.isidentifier function also confirms this is an invalid identifier:

python3 interpreter:

>>> print('foo
'.isidentifier())
False

The `$` anchor vs the `\Z` anchor

Quoting the official python2 Regular Expression syntax:

$

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1 foo2 ' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo ' will find two (empty) matches: one just before the newline, and one at the end of the string.

This results in a string that ends with a newline to match as a valid identifier:

>>> import tokenize
>>> import re
>>> re.match(tokenize.Name + '$', 'foo
')
<_sre.SRE_Match at 0x3eac8e0>
>>> print m.group()
'foo'

The regex pattern should not use the $ anchor but instead \Z is the anchor that should be used. Quoting once again:

\Z

Matches only at the end of the string.

And now the regex is a valid one:

>>> re.match(tokenize.Name + r'\Z', 'foo
') is None
True

Dangerous Implications

See Luke's answer for another example how this kind of weak regex matching could potentially in other circumstances have more dangerous implications.

How can get Python isidentifer() functionality in Python 2.6?

Tags:

python

python-3.x

identifier

python-2.6

Douglas S. J. De Couto

4 Answers

John La Rooy

Invalid Identifier Validation

The `$` anchor vs the `\Z` anchor

Dangerous Implications

Further Reading

ch0wner

SilentGhost

Jason R. Coombs

Recent Activity

Donate For Us

How can get Python isidentifer() functionality in Python 2.6?

Tags:

python

python-3.x

identifier

python-2.6

Douglas S. J. De Couto

4 Answers

John La Rooy

Invalid Identifier Validation

The $ anchor vs the \Z anchor

Dangerous Implications

Further Reading

ch0wner

SilentGhost

Jason R. Coombs

Related questions

Recent Activity

Donate For Us

The `$` anchor vs the `\Z` anchor