Despite of numerous SO threads on the topic, I'm having trouble with parsing CSV. It's a .csv file downloaded from the Adwords Keyword Planner. Previously, Adwords had the option of exporting data as 'plain CSV' (which could be parsed with the Ruby CSV library), now the options are either Adwords CSV or Excel CSV. BOTH of these formats cause this problem (illustrated by a terminal session):
file = File.open('public/uploads/testfile.csv')
=> #<File:public/uploads/testfile.csv>
file.read.encoding
=> #<Encoding:UTF-8>
require 'csv'
=> true
CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8
Let's change the encoding and see if that helps:
file.close
=> nil
file = File.open("public/uploads/testfile.csv", "r:ISO-8859-1")
=> #<File:public/uploads/testfile.csv>
file.read.encoding
=> #<Encoding:ISO-8859-1>
CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8
Let's try using a different CSV library:
require 'smarter_csv'
=> true
file.close
=> nil
file = SmarterCSV.process('public/uploads/testfile.csv')
ArgumentError: invalid byte sequence in UTF-8
Is this a no-win situation? Do I have to roll my own CSV parser?
I'm using Ruby 1.9.3p374. Thanks!
UPDATE 1:
Using the suggestions in the comments, here's the current version:
file_contents = File.open("public/uploads/new-format/testfile-adwords.csv", 'rb').read
require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
file_contents.encode!('UTF-8', 'UTF-16')
else
ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
file_contents = ic.iconv(file_contents)
end
file_contents.gsub!(/\0/, '') #needed because otherwise, I get "string contains null byte (ArgumentError)"
CSV.foreach(file_contents, :headers => true, :header_converters => :symbol) do |row|
puts row
end
This doesn't work - now I get a "file name too long" error.
Looking at the file in question:
$ curl -s http://jamesabbottdd.com/examples/testfile.csv | xxd | head -n3
0000000: fffe 4300 6100 6d00 7000 6100 6900 6700 ..C.a.m.p.a.i.g.
0000010: 6e00 0900 4300 7500 7200 7200 6500 6e00 n...C.u.r.r.e.n.
0000020: 6300 7900 0900 4200 7500 6400 6700 6500 c.y...B.u.d.g.e.
The byte order markffee
at the start suggests the file encoding is little endian UTF-16, and the 00
bytes at every other position back this up.
This would suggest that you should be able to do this:
CSV.foreach('./testfile.csv', :encoding => 'utf-16le') do |row| ...
However that gives me invalid byte sequence in UTF-16LE (ArgumentError)
coming from inside the CSV library. I think this is due to IO#gets only returning a single byte for some reason when faced with the BOM when called in CSV, resulting in the invalid UTF-16.
You can get CSV to strip of the BOM, by using bom|utf-16-le
as the encoding:
CSV.foreach('./testfile.csv', :encoding => 'bom|utf-16le') do |row| ...
You might prefer to convert the string to a more familiar encoding instead, in which case you could do:
CSV.foreach('./testfile.csv', :encoding => 'utf-16le:utf-8') do |row| ...
Both of these appear to work okay.
Converting the file to UTF8 first and then reading it also works nicely:
iconv -f utf-16 -t utf8 testfile.csv | ruby -rcsv -e 'CSV(STDIN).each {|row| puts row}'
Iconv seems to understand correctly that the file has a BOM at the start and strips it off when converting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With