Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx doesn't work with .NET, but does with other RegEx implementations

Tags:

html

c#

.net

regex

I'm trying to match strings that look like this:

http://www.google.com

But not if it occurs in larger context like this:

<a href="http://www.google.com"> http://www.google.com </a>

The regex I've got that does the job in a couple different RegEx engines I've tested (PHP, ActionScript) looks like this:

(?<!["'>]\b*)((https?://)([A-Za-z0-9_=%&@?./-]+))\b

You can see it working here: http://regexr.com?36g0e

The problem is that that particular RegEx doesn't seem to work correctly under .NET.

private static readonly Regex fixHttp = new Regex(@"(?<![""'>]\b*)((https?://)([A-Za-z0-9_=%&@?./-]+))\b", RegexOptions.IgnoreCase);
private static readonly Regex fixWww = new Regex(@"(?<=[\s])\b((www\.)([A-Za-z0-9_=%&@?./-]+))\b", RegexOptions.IgnoreCase);

public static string FixUrls(this string s)
{
    s = fixHttp.Replace(s, "<a href=\"$1\">$1</a>");
    s = fixWww.Replace(s, "<a href=\"http://$1\">$1</a>");
    return s;
}

Specifically, .NET doesn't seem to be paying attention to the first \b*. In other words, it correctly fails to match this string:

<a href="http://www.google.com">http://www.google.com</a>

But it incorrectly matches this string (note the extra spaces):

<a href="http://www.google.com"> http://www.google.com </a>

Any ideas as to what I'm doing wrong or how to work around it?

like image 417
Ken Smith Avatar asked Sep 26 '13 06:09

Ken Smith


1 Answers

I was waiting for one of the folks who actually originally answered this question to pop the answer down here, but since they haven't, I'll throw it in.

I'm not precisely sure what was going wrong, but it turns out that in .NET, I needed to replace the \b* with a \s*. The \s* doesn't seem to work with other RegEx engines (I only did a little bit of testing), but it does work correctly with .NET. The documentation I've read around \b would lead me to believe that it should match whitespace leading up to a word as well, but perhaps I've misunderstood, or perhaps there are some weirdnesses around captures that different engines handle differently.

At any rate, this is my final RegEx:

(?<!["'>]\s*)((https?:\/\/)([A-Za-z0-9_=%&@\?\.\/\-]+))\b

I don't understand what was going wrong well enough to give any real context for why this change works, and I dislike RegExes enough that I can't quite justify the time figuring it out, but maybe it'll help someone else eventually :-).

like image 113
Ken Smith Avatar answered Oct 20 '22 18:10

Ken Smith