Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I find the shortest overlapping match using regular expressions?

Tags:

python

regex

I'm still relatively new to regex. I'm trying to find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a substring of a larger match. For example:

import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'a.*?b.*?c'

my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string)

for match in matches:
    print match

prints:

A|B|A|B|C

but I'd want it to return:

A|B|C

Is there a way to do this without having to loop over each match to see if it contains a substring that matches?

like image 439
ryan Avatar asked Jan 27 '10 16:01

ryan


People also ask

How do you match a regular expression?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What does the regular expression '[ a za z ]' match?

For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.

Is regex matching fast?

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow.


2 Answers

Contrary to most other answers here, this can be done in a single regex using a positive lookahead assertion with a capturing group:

>>> my_pattern = '(?=(a.*?b.*?c))'
>>> my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
>>> matches = my_regex.findall(string)
>>> print min(matches, key=len)
A|B|C

findall() will return all possible matches, so you need min() to get the shortest one.

How this works:

  • We're not matching any text in this regex, just positions in the string (which the regex engine steps through during a match attempt).
  • At each position, the regex engine looks ahead to see whether your regex would match at this position.
  • If so, it will be captured by the capturing group.
  • If not, it won't.
  • In either case, the regex engine then steps ahead one character and repeats the process until the end of the string.
  • Since the lookahead assertion doesn't consume any characters, all overlapping matches will be found.
like image 162
Tim Pietzcker Avatar answered Oct 16 '22 03:10

Tim Pietzcker


No. Perl returns the longest, leftmost match, while obeying your non-greedy quantifiers. You'll have to loop, I'm afraid.

Edit: Yes, I realize I said Perl above, but I believe it is true for Python.

like image 45
Paul Beckingham Avatar answered Oct 16 '22 01:10

Paul Beckingham