Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reasons for this disparity in execution speed?

Tags:

python

ruby

I wrote a quick Python script to compare two files, each containing unordered hashes, in order to verify that both files are identical aside from order. Then I rewrote it in Ruby for educational purposes.

The Python implementation takes seconds, while the Ruby implementation takes around 4 minutes.

I have a feeling this is most likely due to my lack of Ruby knowledge, any ideas on what I am doing wrong?

Environment is Windows XP x64, Python 2.6, Ruby 1.8.6

Python

f = open('c:\\file1.txt', 'r')

hashes = dict()

for line in f.readlines():
    if not line in hashes:
        hashes[line] = 1
    else:
        hashes[line] += 1


print "Done file 1"

f.close()

f = open('c:\\file2.txt', 'r')

for line in f.readlines():
    if not line in hashes:
        print "Hash not found!"
    else:
        hashes[line] -= 1

f.close()

print "Done file 2"

num_errors = 0

for key in hashes.keys():
    if hashes[key] != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

Ruby

file = File.open("c:\\file1.txt", "r")
hashes = {}

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] += 1
  else
    hashes[line] = 1
  end
}

file.close()

puts "Done file 1"

file = File.open("c:\\file2.txt", "r")

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] -= 1
  else
    puts "Hash not found!"
  end
}

file.close()

puts "Done file 2"

num_errors = 0
hashes.each_key{ |key|
  if hashes[key] != 0
    num_errors += 1
  end
}

puts "Total of #{num_errors} mismatches found"

EDIT To give an idea of scale, each file is pretty big, over 900 000 hashes.

PROGRESS

Using a nathanvda's suggestions, here is the optimized ruby script:

f1 = "c:\\file1.txt"
f2 = "c:\\file2.txt"

hashes = Hash.new(0)

File.open(f1, "r") do |f|
  while line = f.gets
    hashes[line] += 1
  end
end  

not_founds = 0

File.open(f2, "r") do |f|
  while line = f.gets
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

num_errors = hashes.values.to_a.select { |z| z != 0}.size   

puts "Total of #{not_founds} lines not found in file2"
puts "Total of #{num_errors} mismatches found"

On windows with Ruby 1.8.7, the original version took 250 seconds and the optimized version took 223.

On a linux VM! running ruby 1.9.1, the original version ran in 81 seconds, about 1/3 the time as windows 1.8.7. Interestingly, the optimized version took longer at 89 seconds. Note that while line = ... was necessary due to memory constraints.

On windows with Ruby 1.9.1, the original took 457 seconds and the optimized version took 543.

On windows with jRuby, the original took 45 seconds and the optimized version took 43.

I am somewhat surprised by the results, I was expecting 1.9.1 to be an improvement over 1.8.7.

like image 670
tscho Avatar asked Dec 07 '09 23:12

tscho


1 Answers

I've found Ruby's reference implementation (well, Ruby) to be (unscientifically stated) dog slow.

If you have the opportunity, please try running your program under JRuby! Charles Nutter and other Sun folks claim to have sped Ruby up dramatically.

I for one would be most interested in your results.

like image 64
Carl Smotricz Avatar answered Sep 27 '22 16:09

Carl Smotricz