Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better way to remove characters that aren't ASCII 32 to 175 C#

I need to remove characters from a string that aren't in the Ascii range from 32 to 175, anything else have to be removed.

I doesn't known well if RegExp can be the best solution instead of using something like .replace() or .remove() pasing each invalid character or something else.

Any help will be appreciated.

like image 330
FabianSilva Avatar asked Jul 18 '12 14:07

FabianSilva


1 Answers

You can use

Regex.Replace(myString, @"[^\x20-\xaf]+", "");

The regex here consists of a character class ([...]) consisting of all characters not (^ at the start of the class) in the range of U+0020 to U+00AF (32–175, expressed in hexadecimal notation). As far as regular expressions go this one is fairly basic, but may puzzle someone not very familiar with it.

But you can go another route as well:

new string(myString.Where(c => (c >= 32) && (c <= 175)).ToArray());

This probably depends mostly on what you're more comfortable with reading. Without much regex experience I'd say the second one would be clearer.

A few performance measurements, 10000 rounds each, in seconds:

2000 characters, the first 143 of which are between 32 and 175
  Regex without +                          4.1171
  Regex with +                             0.4091
  LINQ, where, new string                  0.2176
  LINQ, where, string.Join                 0.2448
  StringBuilder (xanatos)                  0.0355
  LINQ, horrible (HatSoft)                 0.4917
2000 characters, all of which are between 32 and 175
  Regex without +                          0.4076
  Regex with +                             0.4099
  LINQ, where, new string                  0.3419
  LINQ, where, string.Join                 0.7412
  StringBuilder (xanatos)                  0.0740
  LINQ, horrible (HatSoft)                 0.4801

So yes, my approaches are the slowest :-). You should probably go with xanatos' answer and wrap that in a method with a nice, clear name. For inline usage or quick-and-dirty things or where performance does not matter, I'd probably use the regex.

like image 76
Joey Avatar answered Oct 14 '22 15:10

Joey