Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify Regexp for unicode cyrillic characters in Ruby 1.9

#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8> 
p str2.scan /\p{Cyrillic}/ #found all cyrillic characters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why \w ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/. Here is my output of ruby -v

ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

As far as i know 1.9 oniguruma regular expression library has full support for unicode characters.

like image 640
user326922 Avatar asked Apr 27 '10 14:04

user326922


1 Answers

This is as specified in the Ruby documentation: \w is equivalent to [a-zA-Z0-9_] and thus doesn't target any unicode character.

You probably want to use [[:alnum:]] instead, which includes all unicode alphabetic and numeric characters. Check also [[:word:]] and [[:alpha:]].

like image 93
Marc-André Lafortune Avatar answered Oct 03 '22 19:10

Marc-André Lafortune