Why does processing of header converters stop with the first non-String
that's returned from a header converter?
After the built-in :symbol
header converter is triggered, no other converters will be processed. It seems that processing of header converters stops with the first converter that returns anything that's not a String
(i.e. same behavior if you write a custom header converter that returns a Fixnum
, or anything else).
This code works as expected, throwing the exception in :throw_an_exception
require 'csv'
CSV::HeaderConverters[:throw_an_exception] = lambda do |header|
raise 'Exception triggered.'
end
csv_str = "Numbers\n" +
"1\n" +
"4\n" +
"7"
puts CSV.parse(
csv_str,
{
headers: true,
header_converters: [
:throw_an_exception,
:symbol
]
}
)
However, if you switch the order of the header converters so that the :symbol
converter comes first, the :throw_an_exception
lambda is never called.
...
header_converters: [
:symbol,
:throw_an_exception
]
...
So I reached out to JEG2.
I was thinking that converters were intended to be a series of steps in a chain, where all elements were supposed to go through every step. In fact, that's not the way to best use the CSV library, especially if you have a very large amount of data.
The way it should be used (and this is the answer to the "why" question and the explanation for why this is better for performance) is to have the converters work like a series of matchers, where the first matched converter returns a non-String
, which indicates to the CSV library that the current value has been converted successfully. When you do that, the parser can stop as soon as it's a non-String
, and move on to the next header/cell value.
In this way you remove a TON of overhead when parsing CSV data. The larger the file you're processing, the more overhead you eliminate.
Here is the email response I got back:
...
The converters are basically a pipeline of conversions to try. Let's say you're using two converters, one for dates and one for numbers. Without a linked line, we would try both for every field. However, we know a couple of things:
- An unconverterd CSV field is a
String
, because that's how we read it in- A field that is now a non-
String
, has been converted, so we can stop searching for a converter that matches.Given that, the optimization helps our example skip checking the number converter if we already have a
Date
object....
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With