Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex lookahead with multiple negative conditions

I am performing a regex on a HTML string to fetch URL's. I want to fetch all href's and src's that are not javascript. From another SO post I have the following pattern:

/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)*"/

Which fetches me results like:

src="http://www.mydomain.com/path/to/resource/image.gif" alt="" border="0"

This is good because it is missing the .js results. It's bad because it's fetching additional tags in the element. I tried the following amendment to stop at the first ":

/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)[^"]*"/

It works in that it returns href="$url", but it returns results ending in .js. Is there a way to combine a negative lookahead that says:

  • Match string until it comes across another " - i.e. [^"]*; and
  • Do not match string if it ends in .js"

Thanks in advance for any help/tips/pointers.

like image 527
james Avatar asked Oct 17 '25 22:10

james


1 Answers

add a "?" to the "*" before the last quote. This will make the "*" non-greedy, ie: it will stop matching at the first quote, not the last

/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)*?"/
like image 177
HammerNL Avatar answered Oct 19 '25 14:10

HammerNL



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!