performance: ruby CSV.foreach vs CSV.parse

Question

I'm not sure this question is related to ruby only, maybe you'll find it relevant to any other language.

I wonder if I should use parse or foreach:

CSV.parse(filepath) will parse the entire file and return an array of arrays, that will reflect the csv file and will be stored in the memory. Later, I'll process this array rows.
CSV.foreach(filepath) will read/parse the file row-by-row and process it row-by-row.

When it comes to performance, is there any difference? is there a preferable approach?

PS: I know that in ruby I can provide a block with the parse method and then it will handle each row separately.

Garren S · Accepted Answer

Here's my test:

require 'csv'
require 'benchmark'

small_csv_file = "test_data_small_50k.csv"
large_csv_file = "test_data_large_20m.csv"

Benchmark.bmbm do |x|
    x.report("Small: CSV #parse") do 
        CSV.parse(File.open(small_csv_file), headers: true) do |row|
            row
        end
    end

    x.report("Small: CSV #foreach") do
        CSV.foreach(small_csv_file, headers: true) do |row|
            row
        end
    end

    x.report("Large: CSV #parse") do 
        CSV.parse(File.open(large_csv_file), headers: true) do |row|
            row
        end
    end

    x.report("Large: CSV #foreach") do
        CSV.foreach(large_csv_file, headers: true) do |row|
            row
        end
    end
end

Rehearsal -------------------------------------------------------
Small: CSV #parse     0.950000   0.000000   0.950000 (  0.952493)
Small: CSV #foreach   0.950000   0.000000   0.950000 (  0.953514)
Large: CSV #parse   659.000000   2.120000 661.120000 (661.280070)
Large: CSV #foreach 648.240000   1.800000 650.040000 (650.062963)
------------------------------------------- total: 1313.060000sec

                          user     system      total        real
Small: CSV #parse     1.000000   0.000000   1.000000 (  1.143246)
Small: CSV #foreach   0.990000   0.000000   0.990000 (  0.984285)
Large: CSV #parse   646.380000   1.890000 648.270000 (648.286247)
Large: CSV #foreach 651.010000   1.840000 652.850000 (652.874320)

The benchmarks were run on a Macbook Pro with 8GB memory. The results indicate the performance is statistically equivalent using either CSV#parse or CSV#foreach.

Headers options removed (only small file tested):

require 'csv'
require 'benchmark'

small_csv_file = "test_data_small_50k.csv"

Benchmark.bmbm do |x|
    x.report("Small: CSV #parse") do 
        CSV.parse(File.open(small_csv_file)) do |row|
            row
        end
    end

    x.report("Small: CSV #foreach") do
        CSV.foreach(small_csv_file) do |row|
            row
        end
    end
end

Rehearsal -------------------------------------------------------
Small: CSV #parse     0.590000   0.010000   0.600000 (  0.597775)
Small: CSV #foreach   0.620000   0.000000   0.620000 (  0.621950)
---------------------------------------------- total: 1.220000sec

                          user     system      total        real
Small: CSV #parse     0.590000   0.000000   0.590000 (  0.597594)
Small: CSV #foreach   0.610000   0.000000   0.610000 (  0.604537)

Notes:

large_csv_file was of a different structure than small_csv_file and therefore comparing results (i.e. rows/sec) between the two files would be inaccurate.

small_csv_file had 50,000 records

large_csv_file had 1,000,000 records

Headers option set to true reduces performance significantly due to building a hash for each field in the row (see the HeadersConverters section: http://www.ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html)

performance: ruby CSV.foreach vs CSV.parse

Tags:

performance

file

ruby

csv

benams

1 Answers

Garren S

Recent Activity

Donate For Us

performance: ruby CSV.foreach vs CSV.parse

Tags:

performance

file

ruby

csv

benams

1 Answers

Garren S

Related questions

Recent Activity

Donate For Us