I am performing a regex on a HTML string to fetch URL's. I want to fetch all href's and src's that are not javascript. From another SO post I have the following pattern:
/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)*"/
Which fetches me results like:
src="http://www.mydomain.com/path/to/resource/image.gif" alt="" border="0"
This is good because it is missing the .js
results. It's bad because it's fetching additional tags in the element. I tried the following amendment to stop at the first "
:
/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)[^"]*"/
It works in that it returns href="$url", but it returns results ending in .js
. Is there a way to combine a negative lookahead that says:
"
- i.e. [^"]*
; and.js"
Thanks in advance for any help/tips/pointers.
add a "?" to the "*" before the last quote. This will make the "*" non-greedy, ie: it will stop matching at the first quote, not the last
/(href|src)?\="http:\/\/www\.mydomain\.com\/(?:(?!\.js).)*?"/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With