I'm not sure this question is related to ruby only, maybe you'll find it relevant to any other language.
I wonder if I should use parse or foreach:
CSV.parse(filepath)
will parse the entire file and return an array of arrays, that will reflect the csv file and will be stored in the memory. Later, I'll process this array rows.
CSV.foreach(filepath)
will read/parse the file row-by-row and process it row-by-row.
When it comes to performance, is there any difference? is there a preferable approach?
PS: I know that in ruby I can provide a block with the parse method and then it will handle each row separately.
Here's my test:
require 'csv'
require 'benchmark'
small_csv_file = "test_data_small_50k.csv"
large_csv_file = "test_data_large_20m.csv"
Benchmark.bmbm do |x|
x.report("Small: CSV #parse") do
CSV.parse(File.open(small_csv_file), headers: true) do |row|
row
end
end
x.report("Small: CSV #foreach") do
CSV.foreach(small_csv_file, headers: true) do |row|
row
end
end
x.report("Large: CSV #parse") do
CSV.parse(File.open(large_csv_file), headers: true) do |row|
row
end
end
x.report("Large: CSV #foreach") do
CSV.foreach(large_csv_file, headers: true) do |row|
row
end
end
end
Rehearsal -------------------------------------------------------
Small: CSV #parse 0.950000 0.000000 0.950000 ( 0.952493)
Small: CSV #foreach 0.950000 0.000000 0.950000 ( 0.953514)
Large: CSV #parse 659.000000 2.120000 661.120000 (661.280070)
Large: CSV #foreach 648.240000 1.800000 650.040000 (650.062963)
------------------------------------------- total: 1313.060000sec
user system total real
Small: CSV #parse 1.000000 0.000000 1.000000 ( 1.143246)
Small: CSV #foreach 0.990000 0.000000 0.990000 ( 0.984285)
Large: CSV #parse 646.380000 1.890000 648.270000 (648.286247)
Large: CSV #foreach 651.010000 1.840000 652.850000 (652.874320)
The benchmarks were run on a Macbook Pro with 8GB memory. The results indicate the performance is statistically equivalent using either CSV#parse or CSV#foreach.
Headers options removed (only small file tested):
require 'csv'
require 'benchmark'
small_csv_file = "test_data_small_50k.csv"
Benchmark.bmbm do |x|
x.report("Small: CSV #parse") do
CSV.parse(File.open(small_csv_file)) do |row|
row
end
end
x.report("Small: CSV #foreach") do
CSV.foreach(small_csv_file) do |row|
row
end
end
end
Rehearsal -------------------------------------------------------
Small: CSV #parse 0.590000 0.010000 0.600000 ( 0.597775)
Small: CSV #foreach 0.620000 0.000000 0.620000 ( 0.621950)
---------------------------------------------- total: 1.220000sec
user system total real
Small: CSV #parse 0.590000 0.000000 0.590000 ( 0.597594)
Small: CSV #foreach 0.610000 0.000000 0.610000 ( 0.604537)
Notes:
large_csv_file was of a different structure than small_csv_file and therefore comparing results (i.e. rows/sec) between the two files would be inaccurate.
small_csv_file had 50,000 records
large_csv_file had 1,000,000 records
Headers option set to true reduces performance significantly due to building a hash for each field in the row (see the HeadersConverters section: http://www.ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With