Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching extended ASCII characters in .NET Regex

I'm writing a .NET regular expression that needs to match all ASCII and extended ASCII characters except for control characters.

To do this, I consulted the ASCII table and it seems that all these characters have an ASCII encoding of x20 to xFF.

So I suppose

[\x20-\xFF]

should be able to match all the characters that I need. However, in reality, some characters can be matched, while others cannot. For example, if you test with the online tool http://regexhero.net/tester/, or write a simple C# program, you will find that some characters such as "ç" (xE7) can be matched, but some characters such as "œ" (x9C) cannot.

Does anyone have any idea why the regex does not work?

like image 601
user3572645 Avatar asked Mar 05 '15 14:03

user3572645


1 Answers

I've tried to reproduce your error and found nothing wrong with your code:

String pattern = @"[\x20-\xFF]";

// All ANSII 
for (Char ch = ' '; ch <= 255; ++ch)
  if (!Regex.IsMatch(ch.ToString(), pattern)) 
    Console.Write("Failed!");

// All non-ANSII
for (Char ch = (Char)256; ch < Char.MaxValue; ++ch)
  if (Regex.IsMatch(ch.ToString(), pattern)) 
    Console.Write("Failed!");

Then I've examined your samples:

 ((int)'ç').ToString("X2"); // <- returns E7, OK
 ((int)'œ').ToString("X2"); // <- returns 153 NOT x9C 

Note, that 'œ' (x153) is actually outside [0x20..0xFF] and that's why matching returns false. So I guess that you've got a typo

like image 191
Dmitry Bychenko Avatar answered Oct 16 '22 09:10

Dmitry Bychenko