Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

\w in Ruby Regular Expression matches Chinese characters

Tags:

regex

ruby

I use the code below:

puts "matched"  if "中国" =~ /\w+/

it puts "matched" and surprised me, since "中国" is two Chinese characters, it doesn't any of 0-9, a-z, A-Z and _, but why it outputs "matched".

Could somebody give me some clues?

like image 713
ywenbo Avatar asked Dec 31 '10 13:12

ywenbo


1 Answers

I'm not sure of the exact flavor of regex that Ruby uses, but this isn't just a Ruby aberration as .net works this way as well. MSDN says this about it:

\w
Matches any word character. For non-Unicode and ECMAScript implementations, this is the same as [a-zA-Z_0-9]. In Unicode categories, this is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

So it's not the case that \w necessarily just means [a-zA-Z_0-9] - it (and other operators) operate differently on Unicode strings compared to how they do for Ascii ones.

This still makes it different from . though, as \w wouldn't match punctuation characters (sort of - see the \p{Lo} list below though) , spaces, new lines and various other non-word symbols.

As for what exactly \p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc} does match, you can see on a Unicode reference list:

  • \p{Ll} Lowercase Unicode letter
  • \p{Lu} Uppercase Unicode letter
  • \p{Lt} Titlecase Unicode letter
  • \p{Lo} Other Unicode letter
  • \p{Nd} Decimal, number
  • \p{Pc} "Punctuation, connector"
like image 153
Michael Low Avatar answered Oct 17 '22 09:10

Michael Low