Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing a CSV file using different encodings and libraries

Despite of numerous SO threads on the topic, I'm having trouble with parsing CSV. It's a .csv file downloaded from the Adwords Keyword Planner. Previously, Adwords had the option of exporting data as 'plain CSV' (which could be parsed with the Ruby CSV library), now the options are either Adwords CSV or Excel CSV. BOTH of these formats cause this problem (illustrated by a terminal session):

file = File.open('public/uploads/testfile.csv')
 => #<File:public/uploads/testfile.csv> 

file.read.encoding
 => #<Encoding:UTF-8> 

require 'csv'
 => true 

CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8

Let's change the encoding and see if that helps:

file.close
 => nil 

file = File.open("public/uploads/testfile.csv", "r:ISO-8859-1")
 => #<File:public/uploads/testfile.csv> 

file.read.encoding 
=> #<Encoding:ISO-8859-1> 

CSV.foreach(file) { |row| puts row }
ArgumentError: invalid byte sequence in UTF-8

Let's try using a different CSV library:

require 'smarter_csv'
 => true 

file.close
 => nil 

file = SmarterCSV.process('public/uploads/testfile.csv')
ArgumentError: invalid byte sequence in UTF-8

Is this a no-win situation? Do I have to roll my own CSV parser?

I'm using Ruby 1.9.3p374. Thanks!

UPDATE 1:

Using the suggestions in the comments, here's the current version:

file_contents = File.open("public/uploads/new-format/testfile-adwords.csv", 'rb').read

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

file_contents.gsub!(/\0/, '') #needed because otherwise, I get "string contains null byte (ArgumentError)"

CSV.foreach(file_contents, :headers => true, :header_converters => :symbol) do |row|
  puts row
end

This doesn't work - now I get a "file name too long" error.

like image 880
fullstackplus Avatar asked Dec 21 '13 10:12

fullstackplus


2 Answers

Looking at the file in question:

 $ curl -s http://jamesabbottdd.com/examples/testfile.csv | xxd | head -n3
0000000: fffe 4300 6100 6d00 7000 6100 6900 6700  ..C.a.m.p.a.i.g.
0000010: 6e00 0900 4300 7500 7200 7200 6500 6e00  n...C.u.r.r.e.n.
0000020: 6300 7900 0900 4200 7500 6400 6700 6500  c.y...B.u.d.g.e.

The byte order markffee at the start suggests the file encoding is little endian UTF-16, and the 00 bytes at every other position back this up.

This would suggest that you should be able to do this:

CSV.foreach('./testfile.csv', :encoding => 'utf-16le') do |row| ...

However that gives me invalid byte sequence in UTF-16LE (ArgumentError) coming from inside the CSV library. I think this is due to IO#gets only returning a single byte for some reason when faced with the BOM when called in CSV, resulting in the invalid UTF-16.

You can get CSV to strip of the BOM, by using bom|utf-16-le as the encoding:

CSV.foreach('./testfile.csv', :encoding => 'bom|utf-16le') do |row| ...

You might prefer to convert the string to a more familiar encoding instead, in which case you could do:

CSV.foreach('./testfile.csv', :encoding => 'utf-16le:utf-8') do |row| ...

Both of these appear to work okay.

like image 52
matt Avatar answered Oct 05 '22 19:10

matt


Converting the file to UTF8 first and then reading it also works nicely:

iconv -f utf-16 -t utf8 testfile.csv | ruby -rcsv -e 'CSV(STDIN).each {|row| puts row}'

Iconv seems to understand correctly that the file has a BOM at the start and strips it off when converting.

like image 35
Tom De Leu Avatar answered Oct 05 '22 18:10

Tom De Leu