Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parse csv with commas, double quotes and encoding

I'm using ruby 1.9 to parse the following csv file with MacRoman character

# encoding: ISO-8859-1
#csv_parse.csv
Name, main-dialogue
"Marceu", "Give it to him ó he, his wife."

I did the following to parse this.

require 'csv'
input_string = File.read("../csv_parse.rb").force_encoding("ISO-8859-1").encode("UTF-8")
 #=> "Name, main-dialogue\r\n\"Marceu\", \"Give it to him  \x97 he, his wife.\"\r\n"

data = CSV.parse(input_string, :quote_char => "'", :col_sep => "/\",/")
 #=> [["Name, main-dialogue"], ["\"Marceu", " \"Give it to him  \x97 he, his wife.\""]]

So, the problem is the second array in data is of single string rather than 2 strings like: ["\"Marceu\"", " \"Give it to him \x97 he, his wife.\""]] I tried with :col_sep => "," (which is the default behaviour) but it gave me 3 splits.

header = CSV.parse(input_string, :quote_char => "'")[0].map{|a| a.strip.downcase unless a.nil? }
 #=> ["Name", "main-dialogue"]

I've to parse again for the header as there's no double quote here.

The output is intented to be shown in browser again, so character ó should show up as usual and not as \x97 or other.

Is there any way to solve the above problems?

like image 412
zoras Avatar asked Dec 22 '11 10:12

zoras


1 Answers

I think you do have MacRoman encoded data; if you do this in irb:

>> "\x97".force_encoding('MacRoman').encode('UTF-8')

you get this:

=> "ó"

And that seems to be the character that you're expecting. So you want this:

input_string = File.read("../csv_parse.rb").force_encoding('MacRoman').encode('UTF-8')

Then you have two columns in your CSV, the columns are quoted with double quotes (so you don't need :quote_char), and the delimiter is ', ' so this should work:

data = CSV.parse(input_string, :col_sep => ", ")

and data will look like this:

[
    ["Name", "main-dialogue"],
    ["Marceu", "Give it to him  ó he, his wife."]
]
like image 72
mu is too short Avatar answered Sep 27 '22 21:09

mu is too short