Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use :replace, :invalid and :undef args for encoding using CSV.read?

Tags:

ruby

csv

encoding

Since ruby 1.9, CSV uses a parser that can perform encoding, if you use methods like: ::foreach, ::open, ::read, and ::readlines.

For example: CSV.read('path/to/file', encoding: "windows-1252:UTF-8") tries to read a file in windows-1252 and returns an array with utf-8 encoded strings.

If the encode conversion between charsets has undefined characters it gives an Encoding::UndefinedConversionError.

The String.encode method has some nice args to deal with this undefined characters:

str = str.encode('UTF-8', invalid: :replace, undef: :replace, replace: "" )

Is there a way to use this kind of replace rules for undefined conversions between charsets with CSV parser?

Thank you.

like image 851
Andres M Avatar asked Feb 24 '16 14:02

Andres M


2 Answers

There is, indeed, a way. The trick is to define a custom converter that does the conversion you want using String#encode. Converters are run before CSV tries to do its automatic conversion to UTF-8. We pass the custom converter to CSV.read as the :converters option, along with the original :encoding:

UTF8_CONVERTER = ->(field) { field.encode('utf-8', invalid: :replace, undef: :replace, replace: "") }

CSV.read('foo.csv', encoding: 'windows-1252', converters: UTF8_CONVERTER)

Since there aren't any characters in Windows-1252 that aren't also in UTF-8, I'll demonstrate the other way around. Suppose you have this UTF-8 CSV file:

foo,bar
yes👍,no💩

And suppose I want to convert it to ASCII-8BIT (because reasons?). This gives me an error:

CSV.read('emoji.csv', encoding: 'utf-8:ascii-8bit')
# => Encoding::UndefinedConversionError: U+1F44D from UTF-8 to ASCII-8BIT

But if I define a custom converter that replaces those undefined characters, it works perfectly:

ASCII_CONVERTER = ->(field) { field.encode('ascii-8bit', replace: "@") }

CSV.read('emoji.csv', encoding: 'utf-8', converters: ASCII_CONVERTER)
# => [ [ "foo",  "bar"   ],
#      [ "yes@", "no@"] ]

(Note that encoding: 'utf-8' isn't strictly necessary here, since UTF-8 is the default, but it will be necessary if your file has a different encoding.)

like image 93
Jordan Running Avatar answered Nov 15 '22 06:11

Jordan Running


If you want to use the replace behavior of String#encode, you will either have to encode the whole file content with it or do it line by line. You will lose information with this.

This is one way of doing it though:

file = File.open('path/to/file.csv')
file.each do |line|
  # keep in mind that the first parameter here is the destination encoding,
  # the second is the source encoding
  sanitized_line = line.encode('UTF-8', 'windows-1252', invalid: :replace, undef: :replace, replace: '')
  fields_array = CSV.parse_line(sanitized_line)
  # do whatever you want with the fields you extracted
end

If your conversion from one encoding to another is pretty much guaranteed to not loose information (iso-8859-1 to utf-8 for example) I would really recommend to simply convert the file on reading.

Another thing to keep in mind is, that ruby does not try to figure out the encoding of a file you are reading on it's own. If you omit the parameter it only uses the default encoding for it's external and internal encoding. So you have to specify the encoding the file is in yourself. Ruby has no really reliable way of doing this, so in my case I ended up doing this (on a Ubuntu system):

encoding = `file --mime-encoding #{path_to_file} | awk '{print $2}'`.strip
arr_of_arrs = CSV.read(path_to_file, encoding: "#{encoding}:utf-8")
like image 33
KMoschcau Avatar answered Nov 15 '22 07:11

KMoschcau