Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is regex search in substring "not completely equivalent to slicing the string" in Python?

As the documentation stated, using regex.search(string, pos, endpos) is not completely equivalent to slicing the string, i.e. regex.search(string[pos:endpos]). It won't do regex matching as if the string is starting from pos, so ^ does not match the beginning of the substring, but only matches the real beginning of the whole string. However, $ matches either the end of the substring or the whole string.

    >>> re.compile('^am').findall('I am falling in code', 2, 12)
    []        # am is not at the beginning
    >>> re.compile('^am').findall('I am falling in code'[2:12])
    ['am']    # am is the beginning
    >>> re.compile('ing$').findall('I am falling in code', 2, 12)
    ['ing']   # ing is the ending
    >>> re.compile('ing$').findall('I am falling in code'[2:12])
    ['ing']   # ing is the ending

    >>> re.compile('(?<= )am').findall('I am falling in code', 2, 12)
    ['am']    # before am there is a space
    >>> re.compile('(?<= )am').findall('I am falling in code'[2:12])
    []        # before am there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code', 2, 12)
    []        # after ing there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code'[2:12])
    []        # after ing there is no space

    >>> re.compile(r'\bm.....').findall('I am falling in code', 3, 11)
    []
    >>> re.compile(r'\bm.....').findall('I am falling in code'[3:11])
    ['m fall']
    >>> re.compile(r'.....n\b').findall('I am falling in code', 3, 11)
    ['fallin']
    >>> re.compile(r'.....n\b').findall('I am falling in code'[3:11])
    ['fallin']

My questions are... Why is it not consistent between beginning and ending match? Why does using pos and endpos treat the end as the real end, but the start/beginning is not treated as the real start/beginning?

Is there any approach to make using pos and endpos imitate slicing? Because Python copies string when slicing instead of just reference the old one, it would be more efficient to use pos and endpos instead of slicing when working with big string multiple times.

like image 312
fikr4n Avatar asked Jun 23 '15 10:06

fikr4n


People also ask

What is RegEx search in Python?

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern.

What are the benefits of RegEx regular expressions in Python?

A Regular Expression is used for identifying a search pattern in a text string. It also helps in finding out the correctness of the data and even operations such as finding, replacing and formatting the data is possible using Regular Expressions.

What is the difference between match () function and search () function in Python?

Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).

Does slicing a string create a new string in Python?

When you slice strings, they return a new instance of String. Strings are immutable objects.


1 Answers

The starting position argument pos is especially useful for doing lexical analysers for example. The performance difference between slicing a string with [pos:] and using the pos parameter might seem insignificant, but it certainly is not so; see for example this bug report in the JsLex lexer.

Indeed, the ^ matches at the real beginning of the string; or, if MULTILINE is specified, also at the beginning of line; this is also by design so that a scanner based on regular expressions can easily distinguish between real beginning of line/beginning of input and just some other point on a line/within the input.

Do note that you can also use the regex.match(string[, pos[, endpos]]) function to anchor the match to the beginning string or at the position specified by pos; thus instead of doing

>>> re.compile('^am').findall('I am falling in code', 2, 12)
[]

you'd generally implement a scanner as

>>> match = re.compile('am').match('I am falling in code', 2, 12)
>>> match
<_sre.SRE_Match object; span=(2, 4), match='am'>

and then set the pos to match.end() (which in this case returns 4) for the successive matching operations.

The match must be found starting exactly at the pos:

>>> re.compile('am').match('I am falling in code', 1, 12)
>>> 

(Notice how the .match is anchored at the beginning of the input as if by implicit ^ but not to the end of the input; indeed this is often a source of errors as people believe the match has both implicit ^ and $ - Python 3.4 added the regex.fullmatch that does this)


As for why the endpos parameter is not consistent with the pos - that I do not know exactly, but it also makes some sense to me, as in Python 2 there is no fullmatch and there anchoring with $ is the only way to ensure that the entire span must be matched.