There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a>
hyperlink tag (HREF
, inner value, etc.). So NONE of the URLs in these should match:
<a href="http://www.example.com/">something</a> <a href="http://www.example.com/">http://www.example2.com</a> <a href="http://www.example.com/"><b>something</b>http://www.example.com/<span>test</span></a>
Any URL outside of <a></a>
should be matched.
One approach I tried was to use a negative lookahead to see if the first <a>
tag after the URL was an opening <a>
or a closing </a>
. If it is a closing </a>
then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.
In Java, this can be done by using Pattern. matcher(). Find the substring from the first index of match result to the last index of the match result and add this substring into the list. After completing the above steps, if the list is found to be empty, then print “-1” as there is no URL present in the string S.
I was looking for this answer as well and because nothing out there really worked like I wanted it too this is the regex that I created. Obviously since its a regex be aware that this is not a perfect solution.
/(?!<a[^>]*>[^<])(((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?))(?![^<]*<\/a>)/gi
And the whole function to update html is:
function linkifyWithRegex(input) {
let html = input;
let regx = /(?!<a[^>]*>[^<])(((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?))(?![^<]*<\/a>)/gi;
html = html.replace(
regx,
function (match) {
return '<a href="' + match + '">' + match + "</a>";
}
);
return html;
}
You can do it in two steps instead of trying to come up with a single regular expression:
Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).
Match the URL
In Perl it could be:
my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
print "Matched an URL outside a HTML anchor !: $_\n";
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With