Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you search backwards from an offset using a Python regular expression?

Given a string, and a character offset within that string, can I search backwards using a Python regular expression?

The actual problem I'm trying to solve is to get a matching phrase at a particular offset within a string, but I have to match the first instance before that offset.

In a situation where I have a regex that's one symbol long (ex: a word boundary), I'm using a solution where I reverse the string.

my_string = "Thanks for looking at my question, StackOverflow."
offset = 30
boundary = re.compile(r'\b')
end = boundary.search(my_string, offset)
end_boundary = end.start()
end_boundary

Output: 33

end = boundary.search(my_string[::-1], len(my_string) - offset - 1)
start_boundary = len(my_string) - end.start()
start_boundary

Output: 25

my_string[start_boundary:end_boundary]

Output: 'question'

However, this "reverse" technique won't work if I have a more complicated regular expression that may involve multiple characters. For example, if I wanted to match the first instance of "ing" that appears before a specified offset:

my_new_string = "Looking feeding dancing prancing"
offset = 16 # on the word dancing
m = re.match(r'(.*?ing)', my_new_string) # Except looking backwards

Ideal output: feeding

I can likely use other approaches (split the file up into lines, and iterate through the lines backwards) but using a regular expression backwards seems like a conceptually-simpler solution.

like image 310
Irwin Avatar asked Jun 20 '13 00:06

Irwin


2 Answers

Using positive lookbehind to make sure there are at least 30 characters before a word:

# re like: r'.*?(\w+)(?<=.{30})'
m = re.match(r'.*?(\w+)(?<=.{%d})' % (offset), my_string)
if m: print m.group(1)
else: print "no match"

For the other example negative lookbehind may help:

my_new_string = "Looking feeding dancing prancing"
offset = 16
m = re.match(r'.*(\b\w+ing)(?<!.{%d})' % offset, my_new_string)
if m: print m.group(1)

which first greedy matches any character but backtracks until it fails to match 16 characters backwards ((?<!.{16})).

like image 103
perreal Avatar answered Sep 28 '22 17:09

perreal


We can make use of python's regex engine's preference for greediness (sort of, not really), and tell it that what we want is as many characters as possible, but no more than 30, then ....

An appropriate regex, then, can be r'^.{0,30}(\b)'. We want the start of the first capture.

>>> boundary = re.compile(r'^.{0,30}(\b)')
>>> boundary.search("hello, world; goodbye, world; I am not a pie").start(1)
30
>>> boundary.search("hello, world; goodbye, world; I am not").start(1)
30
>>> boundary.search("hello, world; goodbye, world; I am").start(1)
30
>>> boundary.search("hello, world; goodbye, pie").start(1)
26
>>> boundary.search("hello, world; pie").start(1)
17
like image 43
muhmuhten Avatar answered Sep 28 '22 15:09

muhmuhten