Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex \w matches ê

Tags:

regex

Generatlly I alway though that in Regular Expressions \w is short for [A-Za-z0-9_], as per wikipedia

But recently I had an issue, in C#.NET, that it matches something else. I was parsing some French, and discovered that \w matches ê (e-circumflex).

Strange I though, didn't expect that. So I tested the same regex in a couple other languages and noticed some inconsistencies.

Given the following code samples:

C#.NET (Specifically .NET 4.7.2 if that matters), .NET Fiddle here

var r = new Regex(@"\w");
Console.WriteLine(r.IsMatch("ê"));

output :

True

Javascript (Chrome), JSBin here

var r = /\w/;
console.log(r.test("ê"));

//or 
var s = new RegExp('\w');
console.log(s.test("ê"));

output:

false
false

PHP (v7.4.7), onlinephpfunctions here

$str = "ê";
$pattern = "/\w/";
echo preg_match($pattern, $str);

outputs

0

Perl (v5.24.2), link here

$str = "ê";
if ($str =~ m/\w/i) {
  print "Match found\n";
} else {
  print "No match found\n";
}

outputs

No match found

Python, repl.it here

import re
p = re.compile('\w')
m = p.match("ê")
if m:
    print('Match found')

outputs

Match Found

Is it just me, or something doesn't seem right? Anyone know whats going on here, why are .NET and Python different to PHP, JS and, the daddy of them all Perl.

like image 967
OJay Avatar asked May 07 '26 13:05

OJay


1 Answers

In .NET (as well as XMLSchema, Python 3 (not Python 2), ICU (Android, R stringr / stringi functions), \w is Unicode-aware by default.

It is not Unicode-aware by default in PCRE and Java, but you may turn it on using the right flag, /u in PCRE and (?U) / Pattern.UNICODE_CHARACTER_CLASS in Java.

See the Shorthand Character Classes reference:

\w stands for “word character”. It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included. XML Schema and XPath even include all symbols in \w. Again, Java, JavaScript, and PCRE match only ASCII characters with \w.

The Unicode-aware \w meanings:

  • c# - [\p{L}\p{Nd}\p{Mn}\p{Pc}] (source)
  • python - [\p{L}\p{Mn}\p{Nd}_] (source) (Note: this is an approximate pattern that can only be used with PyPi regex since re does not support Unicode property classes, so it's really great \w is Unicode aware in Python 3)
  • android - [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}] (source)
  • icu - [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\u200c\u200d] (source)
  • xsd - [#x0000=#x10FFFF]-[\p{P}\p{Z}\p{C}] (source)

When \w is made Unicode-aware:

  • pcre - (With /u in PHP or (*UCP) / (*UTF)(*UCP)) - [^\p{L}\p{N}_] ("\w any character that matches \p{L} or \p{N}, plus underscore")
  • java - (With (?U) or Pattern.UNICODE_CHARACTER_CLASS) - [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}\p{IsJoin_Control}] (same as Andoid, source)
  • perl - (make the file treat as Unicode, see Does \w match all alphanumeric characters defined in the Unicode standard?) - [\p{GC=Alphabetic}\p{GC=Mark}\p{GC=Connector_Punctuation}\p{GC=Decimal_Number}]

In JavaScript, there is no way to make \w Unicode-aware, so use [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}].

like image 182
Wiktor Stribiżew Avatar answered May 09 '26 02:05

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!