Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you construct a RegEx to replace unwanted characters with the underscore?

Tags:

c#

regex

replace

I'm trying to write a string 'clean-up' function that allows only alphanumeric characters, plus a few others, such as the underscore, period and the minus (dash) character.

Currently our function uses straight char iteration of the source string, but I'm trying to convert it to RegEx because from what I've been reading, it is much cleaner and more performant (which seems backwards to me over a straight iteration, but I can't profile it until I get a working RegEx.)

The problem is two-fold for me. One, I know the following regex...

[a-zA-Z0-9]

...matches a range of alphanumeric characters, but how do I also include the underscore, period and the minus character? Do you simply escape them with the '\' character and put them between the brackets with the rest?

Second, for any character that isn't part of the match (i.e. other punctuation like '?') we would like it replaced with an underscore.

My thinking is to instead match on a range of desired characters, we match on a single character that's not in the desired range, then replace that. I think the RegEx for that is to include the carat as the first character between the brackets like this...

[^a-zA-Z0-9]

Is that the correct approach?

like image 610
Mark A. Donohoe Avatar asked Jul 09 '13 15:07

Mark A. Donohoe


1 Answers

Probably the most efficient way to do this is to set up a static Regex that describes the characters that you want to replace.

public static class StringCleaner
{    
    public static Regex invalidChars = new Regex(@"[^A-Z0-9._\-]", RegexOptions.Compiled | RegexOptions.IgnoreCase);

    public static string ReplaceInvalidChars(string input)
    {
        return invalidChars.Replace(input, "_");
    }
}

However, if you don't want the Regex to replace line ends and whitespace (like spaces and tabs) you'll need to use a slightly different expression.

public static Regex invalidChars = new Regex(@"[^A-Z0-9._\-\s]", RegexOptions.Compiled | RegexOptions.IgnoreCase);

Also, here are the rules for what you must escape to match the literal character:

Inside a set denoted by square brackets you must escape these characters -#]\ anywhere they occur and ^ only if it appears in the first position of the set to match the literal characters. Outside of a set you must escape these characters: .$^|{}[]()+?# to match the literal character.

See the following documentation for more information:

  • .NET Framework Regular Expressions
  • Regex Class
  • RegexOptions Enumeration
like image 139
JamieSee Avatar answered Sep 21 '22 02:09

JamieSee