I'm trying to match strings that look like this:
http://www.google.com
But not if it occurs in larger context like this:
<a href="http://www.google.com"> http://www.google.com </a>
The regex I've got that does the job in a couple different RegEx engines I've tested (PHP, ActionScript) looks like this:
(?<!["'>]\b*)((https?://)([A-Za-z0-9_=%&@?./-]+))\b
You can see it working here: http://regexr.com?36g0e
The problem is that that particular RegEx doesn't seem to work correctly under .NET.
private static readonly Regex fixHttp = new Regex(@"(?<![""'>]\b*)((https?://)([A-Za-z0-9_=%&@?./-]+))\b", RegexOptions.IgnoreCase);
private static readonly Regex fixWww = new Regex(@"(?<=[\s])\b((www\.)([A-Za-z0-9_=%&@?./-]+))\b", RegexOptions.IgnoreCase);
public static string FixUrls(this string s)
{
s = fixHttp.Replace(s, "<a href=\"$1\">$1</a>");
s = fixWww.Replace(s, "<a href=\"http://$1\">$1</a>");
return s;
}
Specifically, .NET doesn't seem to be paying attention to the first \b*
. In other words, it correctly fails to match this string:
<a href="http://www.google.com">http://www.google.com</a>
But it incorrectly matches this string (note the extra spaces):
<a href="http://www.google.com"> http://www.google.com </a>
Any ideas as to what I'm doing wrong or how to work around it?
I was waiting for one of the folks who actually originally answered this question to pop the answer down here, but since they haven't, I'll throw it in.
I'm not precisely sure what was going wrong, but it turns out that in .NET, I needed to replace the \b*
with a \s*
. The \s*
doesn't seem to work with other RegEx engines (I only did a little bit of testing), but it does work correctly with .NET. The documentation I've read around \b
would lead me to believe that it should match whitespace leading up to a word as well, but perhaps I've misunderstood, or perhaps there are some weirdnesses around captures that different engines handle differently.
At any rate, this is my final RegEx:
(?<!["'>]\s*)((https?:\/\/)([A-Za-z0-9_=%&@\?\.\/\-]+))\b
I don't understand what was going wrong well enough to give any real context for why this change works, and I dislike RegExes enough that I can't quite justify the time figuring it out, but maybe it'll help someone else eventually :-).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With