Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex - why does end of string ($ and \Z) not work with group expressions?

Tags:

python

regex

In Python 2.6. it seems that markers of the end of string $ and \Z are not compatible with group expressions. Fo example

import re
re.findall("\w+[\s$]", "green pears")

returns

['green ']

(so $ effectively does not work). And using

re.findall("\w+[\s\Z]", "green pears")

results in an error:

/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in findall(pattern, string, flags)
    175 
    176     Empty matches are included in the result."""
--> 177     return _compile(pattern, flags).findall(string)
    178 
    179 if sys.hexversion >= 0x02020000:

/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.pyc in _compile(*key)
    243         p = sre_compile.compile(pattern, flags)
    244     except error, v:
--> 245         raise error, v # invalid expression
    246     if len(_cache) >= _MAXCACHE:
    247         _cache.clear()

error: internal: unsupported set operator

Why does it work that way and how to go around?

like image 673
Piotr Migdal Avatar asked Oct 06 '12 20:10

Piotr Migdal


People also ask

How do you specify an end in regex?

End of String or Line: $ The $ anchor specifies that the preceding pattern must occur at the end of the input string, or before \n at the end of the input string. If you use $ with the RegexOptions. Multiline option, the match can also occur at the end of a line.

How do anchors work in regex?

Anchors are regex tokens that don't match any characters but that say or assert something about the string or the matching process. Anchors inform us that the engine's current position in the string matches a determined location: for example, the beginning of the string/line, or the end of a string/line.

What is G at end of regex?

RegExp. prototype. global has the value true if the g flag was used; otherwise, false . The g flag indicates that the regular expression should be tested against all possible matches in a string.


Video Answer


1 Answers

A [..] expression is a character group, meaning it'll match any one character contained therein. You are thus matching a literal $ character. A character group always applies to one input character, and thus can never contain an anchor.

If you wanted to match either a whitespace character or the end of the string, use a non-capturing group instead, combined with the | or selector:

r"\w+(?:\s|$)"

Alternatively, look at the \b word boundary anchor. It'll match anywhere a \w group start or ends (so it anchors to points in the text where a \w character is preceded or followed by a \W character, or is at the start or end of the string).

like image 63
Martijn Pieters Avatar answered Sep 30 '22 16:09

Martijn Pieters