#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8>
p str2.scan /\p{Cyrillic}/ #found all cyrillic characters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2
The question is why \w
ignore cyrillic characters?
I have installed latest ruby package from http://rubyinstaller.org/.
Here is my output of ruby -v
ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]
As far as i know 1.9 oniguruma regular expression library has full support for unicode characters.
This is as specified in the Ruby documentation: \w
is equivalent to [a-zA-Z0-9_]
and thus doesn't target any unicode character.
You probably want to use [[:alnum:]]
instead, which includes all unicode alphabetic and numeric characters. Check also [[:word:]]
and [[:alpha:]]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With