Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is a left parenthesis being escaped in this Regex?

Tags:

c#

.net

regex

I'm using an HTML sanitizing whitelist code found here:
http://refactormycode.com/codes/333-sanitize-html

I needed to add the "font" tag as an additional tag to match, so I tried adding this condition after the <img tag check

if (tagname.StartsWith("<font"))
{
    // detailed <font> tag checking
    // Non-escaped expression (for testing in a Regex editor app)
    // ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
    if (!IsMatch(tagname, @"<font
                            (\s*size=""\d{1}"")?
                            (\s*color=""((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)"")?
                            (\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
                             \s*?>"))
    {
        html = html.Remove(tag.Index, tag.Length);
    }
}

Aside from the condition above, my code is almost identical to the code in the page I linked to. When I try to test this in C#, it throws an exception saying "Not enough )'s". I've counted the parenthesis several times and I've run the expression through a few online Javascript-based regex testers and none of them seem to tell me of any problems.

Am I missing something in my Regex that is causing a parenthesis to escape? What do I need to do to fix this?

UPDATE
After a lot of trial and error, I remembered that the # sign is a comment in regexes. The key to fixing this is to escape the # character. In case anyone else comes across the same problem, I've included my fix (just escaping the # sign)

if (tagname.StartsWith("<font"))
{
    // detailed <font> tag checking
    // Non-escaped expression (for testing in a Regex editor app)
    // ^<font(\s*size="\d{1}")?(\s*color="((#[0-9a-f]{6})|(#[0-9a-f]{3})|red|green|blue|black|white)")?(\s*face="(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)")?\s*?>$
    if (!IsMatch(tagname, @"<font
                            (\s*size=""\d{1}"")?
                            (\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
                            (\s*face=""(Arial|Courier\sNew|Garamond|Georgia|Tahoma|Verdana)"")?
                             \s*?>"))
    {
        html = html.Remove(tag.Index, tag.Length);
    }
}
like image 715
Dan Herbert Avatar asked Jan 25 '23 00:01

Dan Herbert


1 Answers

Your IsMatch Method is using the option RegexOptions.IgnorePatternWhitespace, that allows you to put comments inside the regular expressions, so you have to scape the # chatacter, otherwise it will be interpreted as a comment.

if (!IsMatch(tagname,@"<font(\s*size=""\d{1}"")?
    (\s*color=""((\#[0-9a-f]{6})|(\#[0-9a-f]{3})|red|green|blue|black|white)"")?
    (\s*face=""(Arial|Courier New|Garamond|Georgia|Tahoma|Verdana)"")?
    \s?>"))
{
    html = html.Remove(tag.Index, tag.Length);
}
like image 135
Christian C. Salvadó Avatar answered Feb 04 '23 00:02

Christian C. Salvadó