Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to make re find the smallest match while using greedy characters [duplicate]

Tags:

python

regex

Disclaimer: I'm not a regex expert.

I'm using Python re module to perform regex matching on many htm files. One of the patterns is something like this:

<bla><blabla>87765.*</blabla><bla>

The problem I've encountered is that instead of finding all (say) five occurrences of the pattern, it will find only one. Because it welds all the occurrences into one, using the <bla><blabla>87765 part of the first occurrence and the </blabla><bla> part of the last occurrence in the page.

Is there any way to tell re to find the smallest match?

like image 676
Dave Berk Avatar asked Sep 15 '09 18:09

Dave Berk


1 Answers

You can use a reluctant qualifier in your pattern (for more details, reference the python documentation on the *?, +?, and ?? operators):

<bla><blabla>87765.*?</blabla><bla>

Or, exclude < from the possible matched characters:

<bla><blabla>87765[^<]*</blabla><bla>

only if there are no children tags between <blabla> and </blabla>.

like image 183
iammichael Avatar answered Sep 28 '22 06:09

iammichael