Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can get Python isidentifer() functionality in Python 2.6?

Python 3 has a string method called str.isidentifier

How can I get similar functionality in Python 2.6, short of rewriting my own regex, etc.?

like image 953
Douglas S. J. De Couto Avatar asked Mar 30 '10 12:03

Douglas S. J. De Couto


4 Answers

the tokenize module defines a regexp called Name

import re, tokenize, keyword
re.match(tokenize.Name + '$', somestr) and not keyword.iskeyword(somestr)
like image 60
John La Rooy Avatar answered Oct 31 '22 10:10

John La Rooy


Invalid Identifier Validation


All of the answers in this thread seem to be repeating a mistake in the validation which allows strings that are not valid identifiers to be matched like ones.

The regex patterns suggested in the other answers are built from tokenize.Name which holds the following regex pattern [a-zA-Z_]\w* (running python 2.7.15) and the '$' regex anchor.

Please refer to the official python 3 description of the identifiers and keywords (which contains a paragraph that is relevant to python 2 as well).

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

thus 'foo\n' should not be considered as a valid identifier.

While one may argue that this code is functional:

>>>  class Foo():
>>>     pass
>>> f = Foo()
>>> setattr(f, 'foo\n', 'bar')
>>> dir(f)
['__doc__', '__module__', 'foo\n']
>>> print getattr(f, 'foo\n')
bar

As the newline character is indeed a valid ASCII character, it is not considered to be a letter. Further more, there is clearly no practical use of an identifer that ends with a newline character

>>> f.foo\n
SyntaxError: unexpected character after line continuation character

The str.isidentifier function also confirms this is an invalid identifier:

python3 interpreter:

>>> print('foo\n'.isidentifier())
False

The $ anchor vs the \Z anchor


Quoting the official python2 Regular Expression syntax:

$

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.

This results in a string that ends with a newline to match as a valid identifier:

>>> import tokenize
>>> import re
>>> re.match(tokenize.Name + '$', 'foo\n')
<_sre.SRE_Match at 0x3eac8e0>
>>> print m.group()
'foo'

The regex pattern should not use the $ anchor but instead \Z is the anchor that should be used. Quoting once again:

\Z

Matches only at the end of the string.

And now the regex is a valid one:

>>> re.match(tokenize.Name + r'\Z', 'foo\n') is None
True

Dangerous Implications


See Luke's answer for another example how this kind of weak regex matching could potentially in other circumstances have more dangerous implications.

Further Reading


Python 3 added support for non-ascii identifiers see PEP-3131.

like image 45
ch0wner Avatar answered Oct 31 '22 11:10

ch0wner


re.match(r'[a-z_]\w*$', s, re.I)

should do nicely. As far as I know there isn't any built-in method.

like image 2
SilentGhost Avatar answered Oct 31 '22 12:10

SilentGhost


Good answers so far. I'd write it like this.

import keyword
import re

def isidentifier(candidate):
    "Is the candidate string an identifier in Python 2.x"
    is_not_keyword = candidate not in keyword.kwlist
    pattern = re.compile(r'^[a-z_][a-z0-9_]*$', re.I)
    matches_pattern = bool(pattern.match(candidate))
    return is_not_keyword and matches_pattern
like image 1
Jason R. Coombs Avatar answered Oct 31 '22 10:10

Jason R. Coombs