Python 3 has a string method called str.isidentifier
How can I get similar functionality in Python 2.6, short of rewriting my own regex, etc.?
the tokenize module defines a regexp called Name
import re, tokenize, keyword
re.match(tokenize.Name + '$', somestr) and not keyword.iskeyword(somestr)
All of the answers in this thread seem to be repeating a mistake in the validation which allows strings that are not valid identifiers to be matched like ones.
The regex patterns suggested in the other answers are built from tokenize.Name
which holds the following regex pattern [a-zA-Z_]\w*
(running python 2.7.15) and the '$' regex anchor.
Please refer to the official python 3 description of the identifiers and keywords (which contains a paragraph that is relevant to python 2 as well).
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.
thus 'foo\n' should not be considered as a valid identifier.
While one may argue that this code is functional:
>>> class Foo():
>>> pass
>>> f = Foo()
>>> setattr(f, 'foo\n', 'bar')
>>> dir(f)
['__doc__', '__module__', 'foo\n']
>>> print getattr(f, 'foo\n')
bar
As the newline character is indeed a valid ASCII character, it is not considered to be a letter. Further more, there is clearly no practical use of an identifer that ends with a newline character
>>> f.foo\n
SyntaxError: unexpected character after line continuation character
The str.isidentifier
function also confirms this is an invalid identifier:
python3 interpreter:
>>> print('foo\n'.isidentifier())
False
$
anchor vs the \Z
anchorQuoting the official python2 Regular Expression syntax:
$
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
This results in a string that ends with a newline to match as a valid identifier:
>>> import tokenize
>>> import re
>>> re.match(tokenize.Name + '$', 'foo\n')
<_sre.SRE_Match at 0x3eac8e0>
>>> print m.group()
'foo'
The regex pattern should not use the $
anchor but instead \Z
is the anchor that should be used.
Quoting once again:
\Z
Matches only at the end of the string.
And now the regex is a valid one:
>>> re.match(tokenize.Name + r'\Z', 'foo\n') is None
True
See Luke's answer for another example how this kind of weak regex matching could potentially in other circumstances have more dangerous implications.
Python 3 added support for non-ascii identifiers see PEP-3131.
re.match(r'[a-z_]\w*$', s, re.I)
should do nicely. As far as I know there isn't any built-in method.
Good answers so far. I'd write it like this.
import keyword
import re
def isidentifier(candidate):
"Is the candidate string an identifier in Python 2.x"
is_not_keyword = candidate not in keyword.kwlist
pattern = re.compile(r'^[a-z_][a-z0-9_]*$', re.I)
matches_pattern = bool(pattern.match(candidate))
return is_not_keyword and matches_pattern
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With