Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do CSV::HeaderConverters stop processing when a non-String is returned?

Tags:

ruby

csv

Why does processing of header converters stop with the first non-String that's returned from a header converter?

Details

After the built-in :symbol header converter is triggered, no other converters will be processed. It seems that processing of header converters stops with the first converter that returns anything that's not a String (i.e. same behavior if you write a custom header converter that returns a Fixnum, or anything else).


This code works as expected, throwing the exception in :throw_an_exception

require 'csv'

CSV::HeaderConverters[:throw_an_exception] = lambda do |header|
  raise 'Exception triggered.'
end

csv_str = "Numbers\n" +
          "1\n" +
          "4\n" +
          "7"

puts CSV.parse(
  csv_str,
  {
    headers: true,
    header_converters: [
      :throw_an_exception,
      :symbol
    ]
  }
)

However, if you switch the order of the header converters so that the :symbol converter comes first, the :throw_an_exception lambda is never called.

...

header_converters: [
  :symbol,
  :throw_an_exception
]

...
like image 633
jefflunt Avatar asked Oct 18 '22 06:10

jefflunt


1 Answers

So I reached out to JEG2.

I was thinking that converters were intended to be a series of steps in a chain, where all elements were supposed to go through every step. In fact, that's not the way to best use the CSV library, especially if you have a very large amount of data.

The way it should be used (and this is the answer to the "why" question and the explanation for why this is better for performance) is to have the converters work like a series of matchers, where the first matched converter returns a non-String, which indicates to the CSV library that the current value has been converted successfully. When you do that, the parser can stop as soon as it's a non-String, and move on to the next header/cell value.

In this way you remove a TON of overhead when parsing CSV data. The larger the file you're processing, the more overhead you eliminate.

Here is the email response I got back:

...

The converters are basically a pipeline of conversions to try. Let's say you're using two converters, one for dates and one for numbers. Without a linked line, we would try both for every field. However, we know a couple of things:

  • An unconverterd CSV field is a String, because that's how we read it in
  • A field that is now a non-String, has been converted, so we can stop searching for a converter that matches.

Given that, the optimization helps our example skip checking the number converter if we already have a Date object.

...

like image 148
jefflunt Avatar answered Oct 21 '22 00:10

jefflunt