Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode characters in a Ruby script?

Tags:

ruby

unicode

I would like to write a Ruby script which writes Japanese characters to the console. For example:

puts "こんにちは・今日は"

However, I get an exception when running it:

jap.rb:1: Invalid char `\377' in expression
jap.rb:1: Invalid char `\376' in expression

Is it possible to do? I'm using Ruby 1.8.6.

like image 781
Shay Friedman Avatar asked Aug 14 '10 15:08

Shay Friedman


1 Answers

You've saved the file in the UTF-16LE encoding, the one Windows misleadingly calls “Unicode”. This encoding is generally best avoided because it's not an ASCII-superset: each code unit is stored as two bytes, with ASCII characters having the other byte stored as \0. This will confuse an awful lot of software; it is unusual to use UTF-16 for file storage.

What you are seeing with \377 and \376 (octal for \xFF and \xFE) is the U+FEFF Byte Order Mark sequence put at the front of UTF-16 files to distinguish UTF-16LE from UTF-16BE.

Ruby 1.8 is totally byte-based; it makes no attempt to read Unicode characters from a script. So you can only save source files in ASCII-compatible encodings. Normally, you'd want to save your files as UTF-8 (without BOM; the UTF-8 faux-BOM is another great Microsoft innovation that breaks everything). This'd work great for scripts on the web producing UTF-8 pages.

And if you wanted to be sure the source code would be tolerant of being saved in any ASCII-compatible encoding, you could encode the string to make it more resilient (if less readable):

puts "\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe3\x83\xbb\xe4\xbb\x8a\xe6\x97\xa5\xe3\x81\xaf"

However! Writing to the console is itself a big problem. What encoding is used to send characters to the console varies from platform to platform. On Linux or OS X, it's UTF-8. On Windows, it's a different encoding for every installation locale (as selected on “Language for non-Unicode applications” in the “Regional and Language Options” control panel entry), but it's never UTF-8. This setting is—again, misleadingly—known as the ANSI code page.

So if you are using a Japanese Windows install, your console encoding will be Windows code page 932 (a variant of Shift-JIS). If that's the case, you can save the text file from a text editor using “ANSI” or explicitly “Japanese cp932”, and when you run it in Ruby you'll get the right characters out. Again, if you wanted to make the source withstand misencoding, you could escape the string in cp932 encoding:

puts "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd\x81E\x8d\xa1\x93\xfa\x82\xcd"

But if you run it on a machine in another locale, it'll produce different characters. You will be unable to write Japanese to the default console from Ruby on a Western Windows installation (code page 1252).

(Whilst Ruby 1.9 improves Unicode handling a lot, it doesn't change anything here. It's still a bytes-based application using the C standard library IO functions, and that means it is limited to Windows's local code page.)

like image 171
bobince Avatar answered Sep 23 '22 03:09

bobince