Parsing a CSV file using different encodings and libraries

Question

Despite of numerous SO threads on the topic, I'm having trouble with parsing CSV. It's a .csv file downloaded from the Adwords Keyword Planner. Previously, Adwords had the option of exporting data as 'plain CSV' (which could be parsed with the Ruby CSV library), now the options are either Adwords CSV or Excel CSV. BOTH of these formats cause this problem (illustrated by a terminal session):

file = File.open('public/uploads/testfile.csv')
 => #<File:public/uploads/testfile.csv> 

file.read.encoding
 => #<Encoding:UTF-8> 

require 'csv'
 => true 

CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8

Let's change the encoding and see if that helps:

file.close
 => nil 

file = File.open("public/uploads/testfile.csv", "r:ISO-8859-1")
 => #<File:public/uploads/testfile.csv> 

file.read.encoding 
=> #<Encoding:ISO-8859-1> 

CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8

Let's try using a different CSV library:

require 'smarter_csv'
 => true 

file.close
 => nil 

file = SmarterCSV.process('public/uploads/testfile.csv')
ArgumentError: invalid byte sequence in UTF-8

Is this a no-win situation? Do I have to roll my own CSV parser?

I'm using Ruby 1.9.3p374. Thanks!

UPDATE 1:

Using the suggestions in the comments, here's the current version:

file_contents = File.open("public/uploads/new-format/testfile-adwords.csv", 'rb').read

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

file_contents.gsub!(/\0/, '') #needed because otherwise, I get "string contains null byte (ArgumentError)"

CSV.foreach(file_contents, :headers => true, :header_converters => :symbol) do |row|
  puts row
end

This doesn't work - now I get a "file name too long" error.

matt · Accepted Answer

Looking at the file in question:

 $ curl -s http://jamesabbottdd.com/examples/testfile.csv | xxd | head -n3
0000000: fffe 4300 6100 6d00 7000 6100 6900 6700  ..C.a.m.p.a.i.g.
0000010: 6e00 0900 4300 7500 7200 7200 6500 6e00  n...C.u.r.r.e.n.
0000020: 6300 7900 0900 4200 7500 6400 6700 6500  c.y...B.u.d.g.e.

The byte order markffee at the start suggests the file encoding is little endian UTF-16, and the 00 bytes at every other position back this up.

This would suggest that you should be able to do this:

CSV.foreach('./testfile.csv', :encoding => 'utf-16le') do |row| ...

However that gives me invalid byte sequence in UTF-16LE (ArgumentError) coming from inside the CSV library. I think this is due to IO#gets only returning a single byte for some reason when faced with the BOM when called in CSV, resulting in the invalid UTF-16.

You can get CSV to strip of the BOM, by using bom|utf-16-le as the encoding:

CSV.foreach('./testfile.csv', :encoding => 'bom|utf-16le') do |row| ...

You might prefer to convert the string to a more familiar encoding instead, in which case you could do:

CSV.foreach('./testfile.csv', :encoding => 'utf-16le:utf-8') do |row| ...

Both of these appear to work okay.

Tom De Leu · Answer

Converting the file to UTF8 first and then reading it also works nicely:

iconv -f utf-16 -t utf8 testfile.csv | ruby -rcsv -e 'CSV(STDIN).each {|row| puts row}'

Iconv seems to understand correctly that the file has a BOM at the start and strips it off when converting.

Parsing a CSV file using different encodings and libraries

Tags:

parsing

ruby

csv

google-ads-api

fullstackplus

2 Answers

matt

Tom De Leu

Recent Activity

Donate For Us

Parsing a CSV file using different encodings and libraries

Tags:

parsing

ruby

csv

google-ads-api

fullstackplus

2 Answers

matt

Tom De Leu

Related questions

Recent Activity

Donate For Us