I am having problems changing the encoding on a text file in Ruby 1.9.2p290. I am getting the error invalid byte sequence in UTF-8 (ArgumentError). The problem (I think) lies in the fact that the charset seems to be unknown.
From the command line if I do the following:
$ file test.txt
I get:
Non-ISO extended-ASCII English text, with CRLF line terminators
Or, alternatively, if I do:
$ file -i test.txt
I get:
test.txt: text/plain; charset=unknown
However, in Ruby if I do:
data = File.open("test.txt").read
puts data.encoding.name
puts data.valid_encoding?
I get:
UTF-8
false
Here's a simplified snippet of my code:
data = File.open("test.txt").read
data.encode!("UTF-8")
data.each_line do |line|
newfile_data << line
end
In ruby 1.9 every stream has 2 encodings associated with it - external and internal encoding. External encoding is encoding of the text that you read from the stream (in your case this is encoding of the file). Internal encoding is the desired encoding for text that is read from the file.
If you do not set external/internal encoding for the stream then default external/internal encoding of the process will be used. If internal encoding is not specified then string read from the stream are tagged (not converted) with the external encoding (the same as String.force_encoding
.
Most likely you have
Encoding::default_external # => Encoding:UTF-8
Encoding::default_internal # => nil
And your file is encoded in ASCII-based standard character encodings, NOT in UTF-8.
Your Ruby code reads sequence of bytes from external source into UTF-8 string. And because your string contains Non-ISO extended-ASCII English text
you get data.valid_encoding? # => false
.
You need to set external encoding of your stream to the encoding of the file. For example if you have file in cp 1251 encoding with text файл
, then you need to read it with the following code:
data = File.open("test.txt", 'r:windows-1251').read
puts data.encoding.name # => windows-1251
puts data.valid_encoding? # => true
or even specify both internal and external encoding:
data = File.open("test.txt", 'r:windows-1251:utf-8').read
puts data.encoding.name # => utf-8
puts data.valid_encoding? # => true
data = IO.read("test.txt", :encoding => 'windows-1252')
data = data.encode("UTF-8").gsub("\r\n", "\n")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With