I use the code below:
puts "matched" if "中国" =~ /\w+/
it puts "matched"
and surprised me, since "中国" is two Chinese characters, it doesn't any of 0-9, a-z, A-Z and _, but why it outputs "matched".
Could somebody give me some clues?
I'm not sure of the exact flavor of regex that Ruby uses, but this isn't just a Ruby aberration as .net works this way as well. MSDN says this about it:
\w
Matches any word character. For non-Unicode and ECMAScript implementations, this is the same as [a-zA-Z_0-9]. In Unicode categories, this is the same as [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
So it's not the case that \w
necessarily just means [a-zA-Z_0-9]
- it (and other operators) operate differently on Unicode strings compared to how they do for Ascii ones.
This still makes it different from .
though, as \w
wouldn't match punctuation characters (sort of - see the \p{Lo} list below though) , spaces, new lines and various other non-word symbols.
As for what exactly \p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}
does match, you can see on a Unicode reference list:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With