Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby CSV UTF8 encoding error while reading

Tags:

ruby

csv

This is what I was doing:

csv = CSV.open(file_name, "r")

I used this for testing:

line = csv.shift
while not line.nil?
  puts line
  line = csv.shift
end

And I ran into this:

ArgumentError: invalid byte sequence in UTF-8

I read the answer here and this is what I tried

csv = CSV.open(file_name, "r", encoding: "windows-1251:utf-8")

I ran into the following error:

Encoding::UndefinedConversionError: "\x98" to UTF-8 in conversion from Windows-1251 to UTF-8

Then I came across a Ruby gem - charlock_holmes. I figured I'd try using it to find the source encoding.

CharlockHolmes::EncodingDetector.detect(File.read(file_name))
=> {:type=>:text, :encoding=>"windows-1252", :confidence=>37, :language=>"fr"}

So I did this:

csv = CSV.open(file_name, "r", encoding: "windows-1252:utf-8")

And still got this:

Encoding::UndefinedConversionError: "\x8F" to UTF-8 in conversion from Windows-1252 to UTF-8
like image 290
Vighnesh Avatar asked Apr 04 '13 21:04

Vighnesh


1 Answers

It looks like you have problem with detecting the valid encoding of your file. CharlockHolmes provide you with useful tip of :confidence=>37 which simply means the detected encoding may not be the right one.

Basing on error messages and test_transcode.rb from https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb I found the encoding that passes through both of your error messages. With help of String#encode it's easy to test:

"\x8F\x98".encode("UTF-8","cp1256") # => "ڈک"

Your issue looks like strictly related to the file and not to ruby.

In case we are not sure which encoding to use and can agree to loose some character we can use :invalid and :undef params for String#encode, in this case:

"\x8F\x98".encode("UTF-8", "CP1250",:invalid => :replace, :undef => :replace, :replace => "?") # => "Ź?"

other way is to use Iconv *//IGNORE option for target encoding:

Iconv.iconv("UTF-8//IGNORE","CP1250", "\x8F\x98")

As a source encoding suggestion of CharlockHolmes should be pretty good.

PS. String.encode was introduced in ruby 1.9. With ruby 1.8 you can use Iconv

like image 64
chrmod Avatar answered Sep 18 '22 16:09

chrmod