How to specify Regexp for unicode cyrillic characters in Ruby 1.9

Question

#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8> 
p str2.scan /\p{Cyrillic}/ #found all cyrillic characters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why \w ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/. Here is my output of ruby -v

ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

As far as i know 1.9 oniguruma regular expression library has full support for unicode characters.

Marc-André Lafortune · Accepted Answer

This is as specified in the Ruby documentation: \w is equivalent to [a-zA-Z0-9_] and thus doesn't target any unicode character.

You probably want to use [[:alnum:]] instead, which includes all unicode alphabetic and numeric characters. Check also [[:word:]] and [[:alpha:]].

How to specify Regexp for unicode cyrillic characters in Ruby 1.9

Tags:

regex

ruby

encoding

unicode

character-properties

user326922

1 Answers

Marc-André Lafortune

Recent Activity

Donate For Us

How to specify Regexp for unicode cyrillic characters in Ruby 1.9

Tags:

regex

ruby

encoding

unicode

character-properties

user326922

1 Answers

Marc-André Lafortune

Related questions

Recent Activity

Donate For Us