I have the following definition for an Identifier: <pre class="prettyprint"><code>Identifier --> letter{ letter| digit} </code></pre> Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above. I've tried this: <pre class="prettyprint"><code>if re.match('\w+(\w\d)?', i): return True else: return False </code></pre> but when I run my program every time it meets an integer it thinks that it's a valid identifier. For example <pre class="prettyprint"><code>c = 0 ; </code></pre> it prints <code>c</code> as a valid identifier which is fine, but it also prints <code>0</code> as a valid identifer. What am I doing wrong here?

Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up: No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3. The reasons are: <ul> <li> As @JoeCondron pointed out, Python reserved keywords such as <code>True</code>, <code>if</code>, <code>return</code>, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required. </li> <li> Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of <code>\d</code>, <code>\w</code>, <code>\W</code> in the <code>re</code> module, as demonstrated in @martineau's counter-example and explained in great detail by @Hatshepsut's amazing research. </li> </ul> While we could try to solve the first issue using <code>keyword.iskeyword()</code>, as @Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all? As Hatshepsut said: <blockquote> <code>str.isidentifier()</code> works </blockquote> Just use it, problem solved. <hr> As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier: <pre class="prettyprint"><code>identifier ::= (letter|"_") (letter | digit | "_")* </code></pre> Which can be expressed by the regular expression: <pre class="prettyprint"><code>^[^\d\W]\w*\Z </code></pre> Example: <pre class="prettyprint"><code>import re identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE) tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa\n" ] for test in tests: result = re.match(identifier, test) print("%r\t= %s" % (test, (result is not None))) </code></pre> Result: <pre class="prettyprint"><code>'a' = True 'a1' = True '_a1' = True '1a' = False 'aa$%@%' = False 'aa bb' = False 'aa_bb' = True 'aa\n' = False </code></pre>

Regular expression to confirm whether a string is a valid Python identifier?

Tags:

python

regex

for-loop

identifier

I have the following definition for an Identifier:

Identifier --> letter{ letter| digit}

Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.

I've tried this:

if re.match('\w+(\w\d)?', i):     
  return True
else:
  return False

but when I run my program every time it meets an integer it thinks that it's a valid identifier.

For example

c = 0 ;

it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.

What am I doing wrong here?

803

asked Mar 29 '11 14:03

user682194

2 Answers

Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:

No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.

The reasons are:

As @JoeCondron pointed out, Python reserved keywords such as True, if, return, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.
Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of \d, \w, \W in the re module, as demonstrated in @martineau's counter-example and explained in great detail by @Hatshepsut's amazing research.

While we could try to solve the first issue using keyword.iskeyword(), as @Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?

As Hatshepsut said:

str.isidentifier() works

Just use it, problem solved.

As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:

identifier ::=  (letter|"_") (letter | digit | "_")*

Which can be expressed by the regular expression:

^[^\d\W]\w*\Z

Example:

import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)

tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa\n" ]
for test in tests:
    result = re.match(identifier, test)
    print("%r\t= %s" % (test, (result is not None)))

Result:

'a'      = True
'a1'     = True
'_a1'    = True
'1a'     = False
'aa$%@%' = False
'aa bb'  = False
'aa_bb'  = True
'aa\n'   = False

148

answered Sep 21 '22 15:09

MestreLion

str.isidentifier() works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.

str.isidentifier() Return true if the string is a valid identifier according to the language definition, section Identifiers and keywords.

Use keyword.iskeyword() to test for reserved identifiers such as def and class.

@martineau's comment gives the example of '℘᧚' where the regex solutions fail.

>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False

Why does this happen?

Lets define the sets of code points that match the given regular expression, and the set that match str.isidentifier.

import re
import unicodedata

chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}

How many regex matches are not identifiers?

In [26]: len(chars - identifiers)                                                                                                               
Out[26]: 698

How many identifiers are not regex matches?

In [27]: len(identifiers - chars)                                                                                                               
Out[27]: 4

Interesting -- which ones?

In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}                                                       
Out[37]: 
set([
    ('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
    ('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
    ('℘', 'SCRIPT CAPITAL P', 'Sm'),
    ('℮', 'ESTIMATED SYMBOL', 'So'),
])

What's different about these two sets?

They have different Unicode "General Category" values.

In [31]: {unicodedata.category(c) for c in chars - identifiers}                                                                                 
Out[31]: set(['Lm', 'Lo', 'No'])

From wikipedia, that's Letter, modifier; Letter, other; Number, other. This is consistent with the re docs, since \d is only decimal digits:

\d Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])

What about the other way?

In [32]: {unicodedata.category(c) for c in identifiers - chars}                                                                                 
Out[32]: set(['Mn', 'Sm', 'So'])

That's Mark, nonspacing; Symbol, math; Symbol, other.

Where is this all documented?

In the Python Language Reference
In PEP 3131 - Supporting non-ascii identifiers

Where is it implemented?

https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255

I still want a regular expression

Look at the regex module on PyPI.

This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

It includes filters for "General Category".

answered Sep 17 '22 15:09

Hatshepsut

Related questions
                            
                                how to cleanly remove ndb properties
                            
                                Where do I put my blueprint before_request
                            
                                Pandas : histogram with fixed width [closed]
                            
                                Matplotilb bar chart: diagonal tick labels
                            
                                Problems running python script by windows task scheduler that does pscp
                            
                                Django aggregate Count only True values
                            
                                CSV read error: new-line character seen in unquoted field
                            
                                Meaning of end='' in the statement print("\t",end='')? [duplicate]
                            
                                Format of datetime in pyplot axis
                            
                                Check the number of parameters passed in Python function
                            
                                Django: using more than one database with inspectdb?
                            
                                Should I have separate containers for Flask, uWSGI, and nginx?
                            
                                How to create a single table using SqlAlchemy declarative_base
                            
                                NumPy, RuntimeWarning: invalid value encountered in power
                            
                                How to pixelate a square image to 256 big pixels with python?
                            
                                Compute the Jacobian matrix in Python
                            
                                How to update Numpy on Mac OS X Snow Leopard?
                            
                                Why doesn't super(Thread, self).__init__() work for a threading.Thread subclass?
                            
                                RuntimeError using cv.SaveImage in openCV
                            
                                Editing elements in a list in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With