Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String length difference between ruby 1.8 and 1.9

I have a website thats running on ruby 1.8.7 . I have a validation on an incoming post that checks to make sure that we allow upto max of 12000 characters. The spaces are counted as characters and tab and carriage returns are stripped off before the post is subjected to the validation.

Here is the post that is subjected to validation http://pastie.org/5047582

In ruby 1.9 the string length shows up as 11909 which is correct. But when I check the length on ruby 1.8.7 is turns out to be 12044.

I used codepad.org to run this ruby code which gives me http://codepad.org/OxgSuKGZ ( which outputs the length as 12044 which is wrong) but when i run this same code in the console at codeacademy.org the string length is 11909.

Can anybody explain me why this is happening ???

Thanks

like image 524
Raghu Avatar asked Oct 12 '12 21:10

Raghu


1 Answers

This is a Unicode issue. The string you are using contains characters outside the ASCII range, and the UTF-8 encoding that is frequently used encodes those as 2 (or more) bytes.

Ruby 1.8 did not handle Unicode properly, and length simply gives the number of bytes in the string, which results in fun stuff like:

"ą".length
=> 2

Ruby 1.9 has better Unicode handling. This includes length returning the actual number of characters in the string, as long as Ruby knows the encoding:

"ä".length
=> 1

One possible workaround in Ruby 1.8 is using regular expressions, which can be made Unicode aware:

"ą".scan(/./mu).size
=> 1
like image 110
Jakub Wasilewski Avatar answered Sep 17 '22 23:09

Jakub Wasilewski