Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Changing character encoding

I am having problems changing the encoding on a text file in Ruby 1.9.2p290. I am getting the error invalid byte sequence in UTF-8 (ArgumentError). The problem (I think) lies in the fact that the charset seems to be unknown.

From the command line if I do the following:

$ file test.txt

I get:

Non-ISO extended-ASCII English text, with CRLF line terminators

Or, alternatively, if I do:

$ file -i test.txt 

I get:

test.txt: text/plain; charset=unknown

However, in Ruby if I do:

data = File.open("test.txt").read

puts data.encoding.name

puts data.valid_encoding?

I get:

UTF-8
false

Here's a simplified snippet of my code:

data = File.open("test.txt").read

data.encode!("UTF-8")

data.each_line do |line|

  newfile_data << line

end
like image 424
thilton Avatar asked Dec 22 '11 21:12

thilton


2 Answers

In ruby 1.9 every stream has 2 encodings associated with it - external and internal encoding. External encoding is encoding of the text that you read from the stream (in your case this is encoding of the file). Internal encoding is the desired encoding for text that is read from the file.

If you do not set external/internal encoding for the stream then default external/internal encoding of the process will be used. If internal encoding is not specified then string read from the stream are tagged (not converted) with the external encoding (the same as String.force_encoding.

Most likely you have

Encoding::default_external # => Encoding:UTF-8
Encoding::default_internal # => nil

And your file is encoded in ASCII-based standard character encodings, NOT in UTF-8. Your Ruby code reads sequence of bytes from external source into UTF-8 string. And because your string contains Non-ISO extended-ASCII English text you get data.valid_encoding? # => false.

You need to set external encoding of your stream to the encoding of the file. For example if you have file in cp 1251 encoding with text файл, then you need to read it with the following code:

data = File.open("test.txt", 'r:windows-1251').read    
puts data.encoding.name    # => windows-1251
puts data.valid_encoding?  # => true

or even specify both internal and external encoding:

data = File.open("test.txt", 'r:windows-1251:utf-8').read    
puts data.encoding.name    # => utf-8
puts data.valid_encoding?  # => true
like image 148
Aliaksei Kliuchnikau Avatar answered Oct 02 '22 19:10

Aliaksei Kliuchnikau


data = IO.read("test.txt", :encoding => 'windows-1252')
data = data.encode("UTF-8").gsub("\r\n", "\n")
like image 41
sunkencity Avatar answered Oct 02 '22 18:10

sunkencity