I have the following definition for an Identifier:
Identifier --> letter{ letter| digit}
Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.
I've tried this:
if re.match('\w+(\w\d)?', i):
return True
else:
return False
but when I run my program every time it meets an integer it thinks that it's a valid identifier.
For example
c = 0 ;
it prints c
as a valid identifier which is fine, but it also prints 0
as a valid identifer.
What am I doing wrong here?
Python String isidentifier() MethodThe isidentifier() method returns True if the string is a valid identifier, otherwise False. A string is considered a valid identifier if it only contains alphanumeric letters (a-z) and (0-9), or underscores (_). A valid identifier cannot start with a number, or contain any spaces.
fullmatch(). This method checks if the whole string matches the regular expression pattern or not. If it does then it returns 1, otherwise a 0.
A valid identifier must have characters [A-Z] or [a-z] or numbers [0-9], and underscore(_) or a dollar sign ($). for example, @javatpoint is not a valid identifier because it contains a special character which is @. There should not be any space in an identifier. For example, java tpoint is an invalid identifier.
Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:
No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.
The reasons are:
As @JoeCondron pointed out, Python reserved keywords such as True
, if
, return
, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.
Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of \d
, \w
, \W
in the re
module, as demonstrated in @martineau's counter-example and explained in great detail by @Hatshepsut's amazing research.
While we could try to solve the first issue using keyword.iskeyword()
, as @Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?
As Hatshepsut said:
str.isidentifier()
works
Just use it, problem solved.
As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:
identifier ::= (letter|"_") (letter | digit | "_")*
Which can be expressed by the regular expression:
^[^\d\W]\w*\Z
Example:
import re
identifier = re.compile(r"^[^\d\W]\w*\Z", re.UNICODE)
tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa\n" ]
for test in tests:
result = re.match(identifier, test)
print("%r\t= %s" % (test, (result is not None)))
Result:
'a' = True
'a1' = True
'_a1' = True
'1a' = False
'aa$%@%' = False
'aa bb' = False
'aa_bb' = True
'aa\n' = False
str.isidentifier()
works. The regex answers incorrectly fail to match some valid python identifiers and incorrectly match some invalid ones.
str.isidentifier()
Return true if the string is a valid identifier according to the language definition, section Identifiers and keywords.Use
keyword.iskeyword()
to test for reserved identifiers such as def and class.
@martineau's comment gives the example of '℘᧚'
where the regex solutions fail.
>>> '℘᧚'.isidentifier()
True
>>> import re
>>> bool(re.search(r'^[^\d\W]\w*\Z', '℘᧚'))
False
Lets define the sets of code points that match the given regular expression, and the set that match str.isidentifier
.
import re
import unicodedata
chars = {chr(i) for i in range(0x10ffff) if re.fullmatch(r'^[^\d\W]\w*\Z', chr(i))}
identifiers = {chr(i) for i in range(0x10ffff) if chr(i).isidentifier()}
How many regex matches are not identifiers?
In [26]: len(chars - identifiers)
Out[26]: 698
How many identifiers are not regex matches?
In [27]: len(identifiers - chars)
Out[27]: 4
Interesting -- which ones?
In [37]: {(c, unicodedata.name(c), unicodedata.category(c)) for c in identifiers - chars}
Out[37]:
set([
('\u1885', 'MONGOLIAN LETTER ALI GALI BALUDA', 'Mn'),
('\u1886', 'MONGOLIAN LETTER ALI GALI THREE BALUDA', 'Mn'),
('℘', 'SCRIPT CAPITAL P', 'Sm'),
('℮', 'ESTIMATED SYMBOL', 'So'),
])
They have different Unicode "General Category" values.
In [31]: {unicodedata.category(c) for c in chars - identifiers}
Out[31]: set(['Lm', 'Lo', 'No'])
From wikipedia, that's Letter, modifier
; Letter, other
; Number, other
. This is consistent with the re docs, since \d
is only decimal digits:
\d
Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
What about the other way?
In [32]: {unicodedata.category(c) for c in identifiers - chars}
Out[32]: set(['Mn', 'Sm', 'So'])
That's Mark, nonspacing
; Symbol, math
; Symbol, other
.
https://github.com/python/cpython/commit/47383403a0a11259acb640406a8efc38981d2255
Look at the regex module on PyPI.
This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.
It includes filters for "General Category".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With