As the documentation stated, using <code>regex.search(string, pos, endpos)</code> is not completely equivalent to slicing the string, i.e. <code>regex.search(string[pos:endpos])</code>. It won't do regex matching as if the string is starting from <code>pos</code>, so <code>^</code> does not match the beginning of the substring, but only matches the real beginning of the whole string. However, <code>$</code> matches either the end of the substring or the whole string. <pre class="prettyprint"><code> >>> re.compile('^am').findall('I am falling in code', 2, 12) [] # am is not at the beginning >>> re.compile('^am').findall('I am falling in code'[2:12]) ['am'] # am is the beginning >>> re.compile('ing$').findall('I am falling in code', 2, 12) ['ing'] # ing is the ending >>> re.compile('ing$').findall('I am falling in code'[2:12]) ['ing'] # ing is the ending >>> re.compile('(?<= )am').findall('I am falling in code', 2, 12) ['am'] # before am there is a space >>> re.compile('(?<= )am').findall('I am falling in code'[2:12]) [] # before am there is no space >>> re.compile('ing(?= )').findall('I am falling in code', 2, 12) [] # after ing there is no space >>> re.compile('ing(?= )').findall('I am falling in code'[2:12]) [] # after ing there is no space >>> re.compile(r'\bm.....').findall('I am falling in code', 3, 11) [] >>> re.compile(r'\bm.....').findall('I am falling in code'[3:11]) ['m fall'] >>> re.compile(r'.....n\b').findall('I am falling in code', 3, 11) ['fallin'] >>> re.compile(r'.....n\b').findall('I am falling in code'[3:11]) ['fallin'] </code></pre> My questions are... Why is it not consistent between beginning and ending match? Why does using <code>pos</code> and <code>endpos</code> treat the end as the real end, but the start/beginning is not treated as the real start/beginning? Is there any approach to make using <code>pos</code> and <code>endpos</code> imitate slicing? Because Python copies string when slicing instead of just reference the old one, it would be more efficient to use <code>pos</code> and <code>endpos</code> instead of slicing when working with big string multiple times.

The starting position argument <code>pos</code> is especially useful for doing lexical analysers for example. The performance difference between slicing a string with <code>[pos:]</code> and using the <code>pos</code> parameter might seem insignificant, but it certainly is not so; see for example this bug report in the JsLex lexer. Indeed, the <code>^</code> matches at the real beginning of the string; or, if <code>MULTILINE</code> is specified, also at the beginning of line; this is also by design so that a scanner based on regular expressions can easily distinguish between real beginning of line/beginning of input and just some other point on a line/within the input. Do note that you can also use the <code>regex.match(string[, pos[, endpos]])</code> function to anchor the match to the beginning string or at the position specified by <code>pos</code>; thus instead of doing <pre class="prettyprint"><code>>>> re.compile('^am').findall('I am falling in code', 2, 12) [] </code></pre> you'd generally implement a scanner as <pre class="prettyprint"><code>>>> match = re.compile('am').match('I am falling in code', 2, 12) >>> match <_sre.SRE_Match object; span=(2, 4), match='am'> </code></pre> and then set the <code>pos</code> to <code>match.end()</code> (which in this case returns 4) for the successive matching operations. The match must be found starting exactly at the <code>pos</code>: <pre class="prettyprint"><code>>>> re.compile('am').match('I am falling in code', 1, 12) >>> </code></pre> (Notice how the <code>.match</code> is anchored at the beginning of the input as if by implicit <code>^</code> but not to the end of the input; indeed this is often a source of errors as people believe the match has both implicit <code>^</code> and <code>$</code> - Python 3.4 added the <code>regex.fullmatch</code> that does this) <hr> As for why the <code>endpos</code> parameter is not consistent with the <code>pos</code> - that I do not know exactly, but it also makes some sense to me, as in Python 2 there is no <code>fullmatch</code> and there anchoring with <code>$</code> is the only way to ensure that the entire span must be matched.

Why is regex search in substring "not completely equivalent to slicing the string" in Python?

Tags:

python

string

substring

regex

As the documentation stated, using regex.search(string, pos, endpos) is not completely equivalent to slicing the string, i.e. regex.search(string[pos:endpos]). It won't do regex matching as if the string is starting from pos, so ^ does not match the beginning of the substring, but only matches the real beginning of the whole string. However, $ matches either the end of the substring or the whole string.

    >>> re.compile('^am').findall('I am falling in code', 2, 12)
    []        # am is not at the beginning
    >>> re.compile('^am').findall('I am falling in code'[2:12])
    ['am']    # am is the beginning
    >>> re.compile('ing$').findall('I am falling in code', 2, 12)
    ['ing']   # ing is the ending
    >>> re.compile('ing$').findall('I am falling in code'[2:12])
    ['ing']   # ing is the ending

    >>> re.compile('(?<= )am').findall('I am falling in code', 2, 12)
    ['am']    # before am there is a space
    >>> re.compile('(?<= )am').findall('I am falling in code'[2:12])
    []        # before am there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code', 2, 12)
    []        # after ing there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code'[2:12])
    []        # after ing there is no space

    >>> re.compile(r'\bm.....').findall('I am falling in code', 3, 11)
    []
    >>> re.compile(r'\bm.....').findall('I am falling in code'[3:11])
    ['m fall']
    >>> re.compile(r'.....n\b').findall('I am falling in code', 3, 11)
    ['fallin']
    >>> re.compile(r'.....n\b').findall('I am falling in code'[3:11])
    ['fallin']

My questions are... Why is it not consistent between beginning and ending match? Why does using pos and endpos treat the end as the real end, but the start/beginning is not treated as the real start/beginning?

Is there any approach to make using pos and endpos imitate slicing? Because Python copies string when slicing instead of just reference the old one, it would be more efficient to use pos and endpos instead of slicing when working with big string multiple times.

312

asked Jun 23 '15 10:06

fikr4n

1 Answers

The starting position argument pos is especially useful for doing lexical analysers for example. The performance difference between slicing a string with [pos:] and using the pos parameter might seem insignificant, but it certainly is not so; see for example this bug report in the JsLex lexer.

Indeed, the ^ matches at the real beginning of the string; or, if MULTILINE is specified, also at the beginning of line; this is also by design so that a scanner based on regular expressions can easily distinguish between real beginning of line/beginning of input and just some other point on a line/within the input.

Do note that you can also use the regex.match(string[, pos[, endpos]]) function to anchor the match to the beginning string or at the position specified by pos; thus instead of doing

>>> re.compile('^am').findall('I am falling in code', 2, 12)
[]

you'd generally implement a scanner as

>>> match = re.compile('am').match('I am falling in code', 2, 12)
>>> match
<_sre.SRE_Match object; span=(2, 4), match='am'>

and then set the pos to match.end() (which in this case returns 4) for the successive matching operations.

The match must be found starting exactly at the pos:

>>> re.compile('am').match('I am falling in code', 1, 12)
>>>

(Notice how the .match is anchored at the beginning of the input as if by implicit ^ but not to the end of the input; indeed this is often a source of errors as people believe the match has both implicit ^ and $ - Python 3.4 added the regex.fullmatch that does this)

As for why the endpos parameter is not consistent with the pos - that I do not know exactly, but it also makes some sense to me, as in Python 2 there is no fullmatch and there anchoring with $ is the only way to ensure that the entire span must be matched.

answered Oct 20 '22 00:10

Antti Haapala -- Слава Україні

Related questions
                            
                                Safe casting in python
                            
                                python, convert a dictionary to a sorted list by value instead of key
                            
                                reverse() does not work on a Python literal?
                            
                                Raise two errors at the same time
                            
                                How to find Median [duplicate]
                            
                                root mean square in numpy and complications of matrix and arrays of numpy
                            
                                Example for ast.NodeTransformer that mutates an equation
                            
                                Building numpy with ATLAS/LAPACK support
                            
                                SWIG Python bindings to native code not working with OpenCV 2.1
                            
                                QWebView undersampled SVG rendering
                            
                                Cython sum v/s mean memory jump
                            
                                Flask + RabbitMQ + SocketIO - forwarding messages
                            
                                What is the fastest way to compare patches of an array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With