Given a string, and a character offset within that string, can I search backwards using a Python regular expression?
The actual problem I'm trying to solve is to get a matching phrase at a particular offset within a string, but I have to match the first instance before that offset.
In a situation where I have a regex that's one symbol long (ex: a word boundary), I'm using a solution where I reverse the string.
my_string = "Thanks for looking at my question, StackOverflow."
offset = 30
boundary = re.compile(r'\b')
end = boundary.search(my_string, offset)
end_boundary = end.start()
end_boundary
Output: 33
end = boundary.search(my_string[::-1], len(my_string) - offset - 1)
start_boundary = len(my_string) - end.start()
start_boundary
Output: 25
my_string[start_boundary:end_boundary]
Output: 'question'
However, this "reverse" technique won't work if I have a more complicated regular expression that may involve multiple characters. For example, if I wanted to match the first instance of "ing" that appears before a specified offset:
my_new_string = "Looking feeding dancing prancing"
offset = 16 # on the word dancing
m = re.match(r'(.*?ing)', my_new_string) # Except looking backwards
Ideal output: feeding
I can likely use other approaches (split the file up into lines, and iterate through the lines backwards) but using a regular expression backwards seems like a conceptually-simpler solution.
Using positive lookbehind to make sure there are at least 30 characters before a word:
# re like: r'.*?(\w+)(?<=.{30})'
m = re.match(r'.*?(\w+)(?<=.{%d})' % (offset), my_string)
if m: print m.group(1)
else: print "no match"
For the other example negative lookbehind may help:
my_new_string = "Looking feeding dancing prancing"
offset = 16
m = re.match(r'.*(\b\w+ing)(?<!.{%d})' % offset, my_new_string)
if m: print m.group(1)
which first greedy matches any character but backtracks until it fails to match 16 characters backwards ((?<!.{16})
).
We can make use of python's regex engine's preference for greediness (sort of, not really), and tell it that what we want is as many characters as possible, but no more than 30, then ....
An appropriate regex, then, can be r'^.{0,30}(\b)'
. We want the start of the first capture.
>>> boundary = re.compile(r'^.{0,30}(\b)')
>>> boundary.search("hello, world; goodbye, world; I am not a pie").start(1)
30
>>> boundary.search("hello, world; goodbye, world; I am not").start(1)
30
>>> boundary.search("hello, world; goodbye, world; I am").start(1)
30
>>> boundary.search("hello, world; goodbye, pie").start(1)
26
>>> boundary.search("hello, world; pie").start(1)
17
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With