Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode Strings in Ruby 1.9

I've written a Ruby script that is reading a file (File.read()) that contains unicode characters, and it works fine from the command line.

However, when I try to put it into an Automator Workflow (Mac OS X), I get this error;

2009-12-23 17:55:15 -0500: /Users/jeffreyaylesworth/bin/symbols:19:in `split': invalid byte sequence in US-ASCII (ArgumentError)
(traceback)

So when running from Automator, split suddenly doesn't like non ASCII characters. As far as I can tell, both are running from the same version of Ruby (the version number is the same).

I'm not too concerned about why they are acting different (but if someone knows, that's great), but I would like a solution to make split accept non ASCII characters.

If it helps, I need to split text at a single character into two pieces, so if something that's similar to C's tokenizer would work, I can use that.

like image 855
Jeffrey Aylesworth Avatar asked Dec 23 '09 23:12

Jeffrey Aylesworth


2 Answers

You don't specify the encoding of the file. Since it is impossible to reliably determine the encoding of a file automatically, the encoding must be explicitly specified. If it isn't, the external encoding is used, if that isn't set, the encoding specified in the environment is going to be used, and if the environment doesn't specify an encoding, the file is assumed to be in 7 bit US-ASCII.

In your case, it seems that there is either a difference in the two environments (automated scripts are often run in a very restrictive environment without locale settings) or in the way the interpreter gets invoked.

So, you'd need to do something like

File.read('/path/to/file', encoding: 'UTF-8')
like image 118
Jörg W Mittag Avatar answered Nov 09 '22 04:11

Jörg W Mittag


Sounds like the two are being run from different environments - with different LOCALE values.

like image 30
Paul Beckingham Avatar answered Nov 09 '22 03:11

Paul Beckingham