I'm matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough:
t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*"
In my markup language parser I match unicode characters by allowing all the characters except those I explicitly use, because my markup language only has two or three of characters I need to escape that way.
How do I match all unicode characters with python regexs and ply? Also is this a good idea at all?
I'd want to let people use identifiers like Ω » « ° foo² väli π as an identifiers (variable names and such) in their programs. Heck! I want that people could write programs in their own language if it's practical! Anyway unicode is supported nowadays in wide variety of places, and it should spread.
Edit: POSIX character classes doesnt seem to be recognised by python regexes.
>>> import re
>>> item = re.compile(r'[[:word:]]')
>>> print item.match('e')
None
Edit: To explain better what I need. I'd need a regex -thing that matches all the unicode printable characters but not ASCII characters at all.
Edit: r"\w" does a bit stuff what I want, but it does not match « », and I also need a regex that does not match numbers.
the re module supports the \w syntax which:
If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
therefore the following examples shows how to match unicode identifiers:
>>> import re
>>> m = re.compile('(?u)[^\W0-9]\w*')
>>> m.match('a')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('9')
>>> m.match('ab')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('a9')
<_sre.SRE_Match object at 0xb7d75410>
>>> m.match('unicöde')
<_sre.SRE_Match object at 0xb7c258e0>
>>> m.match('ödipus')
<_sre.SRE_Match object at 0xb7d75410>
So the expression you look for is: (?u)[^\W0-9]\w*
You need pass pass parameter reflags in lex.lex:
lex.lex(reflags=re.UNICODE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With