Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use re.search starting from a certain index in the string?

Tags:

python

regex

Seems like a simple thing but I'm not seeing it. How do I start the search in the middle of a string?

like image 932
btd Avatar asked Oct 17 '14 23:10

btd


1 Answers

The re.search function doesn't take a start argument like the str methods do. But search method of a compiled re.compile/re.RegexObject pattern does take a pos argument.

This makes sense if you think about it. If you really need to use the same regular expressions over and over, you probably should be compiling them. Not so much for efficiency—the cache works nicely for most applications—but just for readability.


But what if you need to use the top-level function, because you can't pre-compile your patterns for some reason?

Well, there are plenty of third-party regular expression libraries. Some of these wrap PCRE or Google's RE2 or ICU, some implement regular expressions from scratch, and they all have at least slightly different, sometimes radically different, APIs.

But the regex module, which is being designed to be an eventual replacement for re in the stdlib (although it's been bumped a couple times now because it's not quite ready) is pretty much usable as a drop-in replacement for re, and (among other extensions) it takes pos and endpos arguments on its search function.


Normally, the most common reason you'd want to do this is to "find the next match after the one I just found", and there's a much easier way to do that: use finditer instead of search.

For example, this str-method loop:

i = 0
while True:
    i = s.find(sub, i)
    if i == -1:
        break
    do_stuff_with(s, i)

… translates to this much nicer regex loop:

for match in re.finditer(pattern, s):
    do_stuff_with(match)

When that isn't appropriate, you can always slice the string:

match = re.search(pattern, s[index:])

But that makes an extra copy of half your string, which could be a problem if string is actually, say, a 12GB mmap. (Of course for the 12GB mmap case, you'd probably want to map a new window… but there are cases where that won't help.)


Finally, you can always just modify your pattern to skip over index characters:

match = re.search('.{%d}%s' % (index, pattern), s)

All I've done here is to add, e.g., .{20} to the start of the pattern, which means to match exactly 20 of any character, plus whatever else you were trying to match. Here's a simple example:

.{3}(abc)

Regular expression visualization

Debuggex Demo

If I give this abcdefabcdef, it will match the first 'abc' after the 3rd character—that is, the second abc.

But notice that what it actually matches 'defabc'. Because I'm using capture groups for my real pattern, and I'm not putting the .{3} in a group, match.group(1) and so on will work exactly as I'd want them to, but match.group(0) will give me the wrong thing. If that matters, you need lookbehind.

like image 163
abarnert Avatar answered Oct 07 '22 01:10

abarnert