Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I have a non-greedy regex with dotall?

I would like to match dotall and non-greedy. This is what I have:

img(.*?)(onmouseover)+?(.*?)a

However, this is not being non-greedy. This data is not matching as I expected:

<img src="icon_siteItem.gif" alt="siteItem" title="A version of this resource is available on siteItem" border="0"></a><br><br></td><td rowspan="4" width="20"></td></tr><tr><td>An activity in which students find other more specific adjectives to 
describe a range of nouns, followed by writing a postcard to describe a 
nice holiday without using the word 'nice'.</td></tr><tr><td>From the resource collection: <a href="http://www.siteItem.co.uk/index.asp?CurrMenu=searchresults&amp;tag=326" title="Resources to help work">Drafting </a></td></tr><tr><td><abbr style="border-bottom:0px" title="Key Stage 3">thing</abbr> | <abbr style="border-bottom:0px" title="Key Stage 4">hello</abbr> | <abbr style="border-bottom:0px" title="Resources">Skills</abbr></td></tr></tbody></table></div></div></td></tr><tr><td><div style="padding-left: 30px"><div><table style="" bgcolor="#DFE7EE" border="0" cellpadding="0" cellspacing="5" width="100%"><tbody><tr valign="top"><td rowspan="4" width="60"><a href="javascript:requiresLevel0(350,350);"><img name="/attachments/3700.pdf" onmouseover="ChangeImageOnRollover(this,'/application/files/images/attach_icons/rollover_pdf.gif')" onmouseout="ChangeImageOnRollover(this,'/application/files/images/attach_icons/small_pdf.gif')" src="small_pdf.gif" alt="Download Recognising and avoiding ambiguity in PDF format" title="Download in PDF format" style="vertical-align: middle;" border="0"></a><br>790.0 k<br>

and I cannot understand why.

What I think I am stating in the above regex is:

start with "img", then allow 0 or more any character including new line, then look for at least 1 "onmouseover", then allow 0 or more any character including new line, then an "a"

Why doesn't this work as I expected?

KEY POINT: dotall must be enabled

like image 610
Django Doctor Avatar asked Feb 29 '12 22:02

Django Doctor


2 Answers

It is being non-greedy. It is your understanding of non-greedy that is not correct.

A regex will always try to match.

Let me show a simplified example of what non-greedy actually means(as suggested by a comment):

re.findall(r'a*?bc*?', 'aabcc', re.DOTALL)

This will match:

  • as few repetitions of 'a' as possible (in this case 2)
  • followed by a 'b'
  • and as few repetitions of 'c' as possible (in this case 0)

so the only match is 'aab'.

And just to conclude:

Don't use regex to parse HTML. There are libraries that were made for the job. re is not one of them.

like image 100
stranac Avatar answered Oct 13 '22 15:10

stranac


First of all, your regex looks a little funky: you're saying match "img", then any number of characters, "onmouseover" at least once, but possibly repeated (e.g. "onmouseoveronmouseoveronmouseover"), followed by any number of characters, followed by "a".

This should match from img src="icon_ all the way to onmouseover="Cha. That's probably not what you want, but it's what you asked for.

Second, and this is significanly more important:

DON'T USE REGULAR EXPESSIONS TO PARSE HTML.

And in case you didn't understand it the first time, let me repeat it in italics:

DON'T USE REGULAR EXPESSIONS TO PARSE HTML.

Finally, let me link you to the canonical grimoire on the subject:

You can't parse [X]HTML with a regex

like image 43
tylerl Avatar answered Oct 13 '22 17:10

tylerl