Seems like a simple thing but I'm not seeing it. How do I start the search in the middle of a string?
The re.search
function doesn't take a start
argument like the str
methods do. But search
method of a compiled re.compile
/re.RegexObject
pattern does take a pos
argument.
This makes sense if you think about it. If you really need to use the same regular expressions over and over, you probably should be compiling them. Not so much for efficiency—the cache works nicely for most applications—but just for readability.
But what if you need to use the top-level function, because you can't pre-compile your patterns for some reason?
Well, there are plenty of third-party regular expression libraries. Some of these wrap PCRE or Google's RE2 or ICU, some implement regular expressions from scratch, and they all have at least slightly different, sometimes radically different, APIs.
But the regex
module, which is being designed to be an eventual replacement for re
in the stdlib (although it's been bumped a couple times now because it's not quite ready) is pretty much usable as a drop-in replacement for re
, and (among other extensions) it takes pos
and endpos
arguments on its search
function.
Normally, the most common reason you'd want to do this is to "find the next match after the one I just found", and there's a much easier way to do that: use finditer
instead of search
.
For example, this str-method loop:
i = 0
while True:
i = s.find(sub, i)
if i == -1:
break
do_stuff_with(s, i)
… translates to this much nicer regex loop:
for match in re.finditer(pattern, s):
do_stuff_with(match)
When that isn't appropriate, you can always slice the string:
match = re.search(pattern, s[index:])
But that makes an extra copy of half your string, which could be a problem if string
is actually, say, a 12GB mmap
. (Of course for the 12GB mmap
case, you'd probably want to map a new window… but there are cases where that won't help.)
Finally, you can always just modify your pattern to skip over index
characters:
match = re.search('.{%d}%s' % (index, pattern), s)
All I've done here is to add, e.g., .{20}
to the start of the pattern, which means to match exactly 20 of any character, plus whatever else you were trying to match. Here's a simple example:
.{3}(abc)
Debuggex Demo
If I give this abcdefabcdef
, it will match the first 'abc'
after the 3rd character—that is, the second abc
.
But notice that what it actually matches 'defabc'
. Because I'm using capture groups for my real pattern, and I'm not putting the .{3}
in a group, match.group(1)
and so on will work exactly as I'd want them to, but match.group(0)
will give me the wrong thing. If that matters, you need lookbehind.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With