Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace with Regex

Tags:

c#

regex

In our application, user is entering data from MS word to asp.net textarea control and finally data is saved in SQL Server. For some reason, there are few junk characters which looks like small squares when viewed from SQL Server Management Studio.

This is causing error while generating Crystal Reports.

I need a regex which will strip all such characters along with bullets. The only valid input is

A-Z, a-z , 0-9, ~ ! @ # % $ ^ & *  ( ) _ + | ` - = \ {}:">? < [ ] ; ' , . /

Also, tab spaces should be replaced with single space. Enter key or New line is allowed.

Currently I am using

Regex.Replace(data, @"[^\u0000-\u007F]", " ");

but it won't work to remove bullets or tab spaces.

Can any regex ninja help me with this problem? Thanks in advance.

like image 666
NoobDeveloper Avatar asked Feb 13 '13 12:02

NoobDeveloper


1 Answers

You can use two regexes. The first, with the pattern "\t|<bullet>" (where <bullet> stands for the representation of the bullet) is used first, to replace TABs and bullets with spaces (" "). The second, the pattern of which is a negated character set containing your list of valid characters, is used second, to replace invalid characters with the empty string (""), that is, to get rid of them. Since you need to keep CR and LF characters (and space), these must be added to the set of valid characters:

using System;
using System.Text.RegularExpressions;

static class Program
{
  public static void Main()
  {
    string pattern1 = @"\t";
    Regex regex1 = new Regex(pattern1, RegexOptions.Compiled);
    string pattern2 = @"[^A-Za-z0-9~!#$^&*()_+|`\-=\\{}:"">?<\[\];',./ \r\n]";
    Regex regex2 = new Regex(pattern2, RegexOptions.Compiled);

    string input = "ABZabz09~!#$^&*()_+|`-=\\{}:\">?<[];',./ \r\nárvíztűrő\ttükörfúrógép";
    string temp = regex1.Replace(input, " ");
    string output = regex2.Replace(temp, "");
    Console.WriteLine(input);
    Console.WriteLine(output);
    Console.ReadKey(true);
  }
}

Output:

ABZabz09~!#$^&*()_+|`-=\{}:">?<[];',./
árvíztűrő       tükörfúrógép
ABZabz09~!#$^&*()_+|`-=\{}:">?<[];',./
rvztr tkrfrgp

Note that the TAB after árvíztűrő was replaced with a single space.

About bullets:

I made a bulleted list in Word and copied it into a textarea in a webpage. Then I saved the HTML and figured out that the bullets are saved as the UTF-8-encoded character E280A2. This is what I called above the "representation of the bullet". You should figure out the binary representation of possible bullet characters and add them to the first pattern: either OR them to the TAB character, or put all of them into a character set:

using System;
using System.Text;
using System.Text.RegularExpressions;

static class Program
{
  public static void Main()
  {
    byte[] bulletBytes = new byte[] { 0xE2, 0x80, 0xA2 };
    string bullet= Encoding.UTF8.GetString(bulletBytes);

    string pattern1 = @"[\t" + bullet + "]";
    Regex regex1 = new Regex(pattern1, RegexOptions.Compiled);
    string pattern2 = @"[^A-Za-z0-9~!#$^&*()_+|`\-=\\{}:"">?<\[\];',./ \r\n]";
    Regex regex2 = new Regex(pattern2, RegexOptions.Compiled);

    string input = 
      bullet + "ABZabz09~!#$^&*()_+|`-=\\{}:\">?<[];',./ \r\n" + 
      bullet + "árvíztűrő\ttükörfúrógép";
    string temp = regex1.Replace(input, " ");
    string output = regex2.Replace(temp, "");
    Console.OutputEncoding = Encoding.UTF8;
    Console.WriteLine(input);
    Console.WriteLine(output);
    Console.ReadKey(true);
  }
}

Output (you should change the console font to Lucida Console to see the bullet):

•ABZabz09~!#$^&*()_+|`-=\{}:">?<[];',./
•árvíztűrő      tükörfúrógép
 ABZabz09~!#$^&*()_+|`-=\{}:">?<[];',./
 rvztr tkrfrgp

Now in addition to TAB, the bullet at the beginning of each line has also been replaced with a space.

like image 185
kol Avatar answered Nov 08 '22 23:11

kol