Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Why is Ruby CSV file reading very slow?




I have a fairly large CSV file, with 4 Million records with 375 fields, that needs to be processed. I'm using the RUBY CSV library to read this file and it is very slow. I thought PHP CSV file processing was slow but comparing the two reads PHP is is more then 100 times faster. I'm not sure if I'm doing something dumb or this is just the reality of RUBY not being optimized for this type of batch processing. I set up simple test pgms to get comparative times in both RUBY and PHP. All I do is read, no writing, no building of big arrays, and break out of the CSV read loops after processing 50,000 records. Has anyone else experienced this performance issue?

I'm running locally on a MAC with 4gig of memory, running OS X 10.6.8 and Ruby 1.8.7.

The Ruby process takes 497 seconds to simply read 50,000 records, the PHP process runs in 4 seconds which is not a typo, it's more then 100 times faster. FYI - I had code in the loops to print out data values to make sure that each of the processes was actually reading the files and bringing data back.

This is the Ruby Code:

CSV.foreach(pathfile) do |row|
  x += 1
  if x > 50000 then break end
t2 = Time.new
puts " Time to read the file was #{t2-t1} seconds"

Here is the PHP code:

$fpiData = fopen($pathdile,'r') or die("can not open input file ");
while($inrec = fgetcsv($fpiData,0,',','"')) {
    if ($seqno > 50000) break;
fclose($fpiData) or die("can not close input data file");
echo "Start time is $t1 - end time is $t2 - Time to Process was " . $t3 . "\n";
like image 499
Eric Avatar asked Mar 23 '23 06:03


2 Answers

You'll likely get a massive speed boost by simply updating to a current version of Ruby. in Version 1.9, FasterCSV was integrated as Ruby's standard CSV library.

Check out Chruby to manage your different Ruby versions.

like image 122
Momer Avatar answered Mar 24 '23 19:03


Check out the smarter_csv Gem, which has special options for handling huge files by reading data in chunks.

It also returns the CSV data as hashes, which can make it easier to insert or update the data in a database.

like image 32
Tilo Avatar answered Mar 24 '23 20:03