Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behaviour of ruby regex in rails with utf8 char

I have problem with one of my validation regex when using nonstandard utf-8 character. So, I run a few experiments and it appears that ruby regex behave different when there are with rails environment or in plain ruby.

I post here my expriment with a Chinese string.

In ruby "pure" :

string = "運動會"
puts string[/\A[\w]*\z/]
=> match "運動會" - ok

In rails :

# coding: utf-8
task :test => :environment do
  string = "運動會"
  puts string[/\A[\w]*\z/]
end
$ rake test
=> nothing - not ok

If I omit # coding: utf-8, it comes with invalid multibyte char (US-ASCII). Anyway, even with this, it doesn't match.

Of course, I have checked everything (ruby_version, encoding of script files in utf-8..)

I use :

  • Rails 3.0.7
  • Ruby 1.9.2 (ruby-1.9.2-p180)

So my conclusion is that rails alter the way regex behave and I did not find a way to make it behaves like in normal ruby.

like image 743
Hartator Avatar asked May 23 '11 09:05

Hartator


1 Answers

Ok, I found an answer to my problem. The \w behaves only with ascii character in ruby 1.9 against all unicode caracter in ruby 1.8. In ruby 1.9, now we have to use : [\w\P{ASCII}]

More infos : http://www.ruby-forum.com/topic/210770

like image 158
Hartator Avatar answered Nov 09 '22 22:11

Hartator