Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

performance: ruby CSV.foreach vs CSV.parse

I'm not sure this question is related to ruby only, maybe you'll find it relevant to any other language.

I wonder if I should use parse or foreach:

  • CSV.parse(filepath) will parse the entire file and return an array of arrays, that will reflect the csv file and will be stored in the memory. Later, I'll process this array rows.

  • CSV.foreach(filepath) will read/parse the file row-by-row and process it row-by-row.

When it comes to performance, is there any difference? is there a preferable approach?

PS: I know that in ruby I can provide a block with the parse method and then it will handle each row separately.

like image 579
benams Avatar asked Oct 03 '13 00:10

benams


1 Answers

Here's my test:

require 'csv'
require 'benchmark'

small_csv_file = "test_data_small_50k.csv"
large_csv_file = "test_data_large_20m.csv"

Benchmark.bmbm do |x|
    x.report("Small: CSV #parse") do 
        CSV.parse(File.open(small_csv_file), headers: true) do |row|
            row
        end
    end

    x.report("Small: CSV #foreach") do
        CSV.foreach(small_csv_file, headers: true) do |row|
            row
        end
    end

    x.report("Large: CSV #parse") do 
        CSV.parse(File.open(large_csv_file), headers: true) do |row|
            row
        end
    end

    x.report("Large: CSV #foreach") do
        CSV.foreach(large_csv_file, headers: true) do |row|
            row
        end
    end
end

Rehearsal -------------------------------------------------------
Small: CSV #parse     0.950000   0.000000   0.950000 (  0.952493)
Small: CSV #foreach   0.950000   0.000000   0.950000 (  0.953514)
Large: CSV #parse   659.000000   2.120000 661.120000 (661.280070)
Large: CSV #foreach 648.240000   1.800000 650.040000 (650.062963)
------------------------------------------- total: 1313.060000sec

                          user     system      total        real
Small: CSV #parse     1.000000   0.000000   1.000000 (  1.143246)
Small: CSV #foreach   0.990000   0.000000   0.990000 (  0.984285)
Large: CSV #parse   646.380000   1.890000 648.270000 (648.286247)
Large: CSV #foreach 651.010000   1.840000 652.850000 (652.874320)

The benchmarks were run on a Macbook Pro with 8GB memory. The results indicate the performance is statistically equivalent using either CSV#parse or CSV#foreach.

Headers options removed (only small file tested):

require 'csv'
require 'benchmark'

small_csv_file = "test_data_small_50k.csv"

Benchmark.bmbm do |x|
    x.report("Small: CSV #parse") do 
        CSV.parse(File.open(small_csv_file)) do |row|
            row
        end
    end

    x.report("Small: CSV #foreach") do
        CSV.foreach(small_csv_file) do |row|
            row
        end
    end
end

Rehearsal -------------------------------------------------------
Small: CSV #parse     0.590000   0.010000   0.600000 (  0.597775)
Small: CSV #foreach   0.620000   0.000000   0.620000 (  0.621950)
---------------------------------------------- total: 1.220000sec

                          user     system      total        real
Small: CSV #parse     0.590000   0.000000   0.590000 (  0.597594)
Small: CSV #foreach   0.610000   0.000000   0.610000 (  0.604537)

Notes:

large_csv_file was of a different structure than small_csv_file and therefore comparing results (i.e. rows/sec) between the two files would be inaccurate.

small_csv_file had 50,000 records

large_csv_file had 1,000,000 records

Headers option set to true reduces performance significantly due to building a hash for each field in the row (see the HeadersConverters section: http://www.ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html)

like image 150
Garren S Avatar answered Nov 20 '22 05:11

Garren S