I wrote a quick Python script to compare two files, each containing unordered hashes, in order to verify that both files are identical aside from order. Then I rewrote it in Ruby for educational purposes.
The Python implementation takes seconds, while the Ruby implementation takes around 4 minutes.
I have a feeling this is most likely due to my lack of Ruby knowledge, any ideas on what I am doing wrong?
Environment is Windows XP x64, Python 2.6, Ruby 1.8.6
Python
f = open('c:\\file1.txt', 'r')
hashes = dict()
for line in f.readlines():
if not line in hashes:
hashes[line] = 1
else:
hashes[line] += 1
print "Done file 1"
f.close()
f = open('c:\\file2.txt', 'r')
for line in f.readlines():
if not line in hashes:
print "Hash not found!"
else:
hashes[line] -= 1
f.close()
print "Done file 2"
num_errors = 0
for key in hashes.keys():
if hashes[key] != 0:
print "Uneven hash count: %s" % key
num_errors += 1
print "Total of %d mismatches found" % num_errors
Ruby
file = File.open("c:\\file1.txt", "r")
hashes = {}
file.each_line { |line|
if hashes.has_key?(line)
hashes[line] += 1
else
hashes[line] = 1
end
}
file.close()
puts "Done file 1"
file = File.open("c:\\file2.txt", "r")
file.each_line { |line|
if hashes.has_key?(line)
hashes[line] -= 1
else
puts "Hash not found!"
end
}
file.close()
puts "Done file 2"
num_errors = 0
hashes.each_key{ |key|
if hashes[key] != 0
num_errors += 1
end
}
puts "Total of #{num_errors} mismatches found"
EDIT To give an idea of scale, each file is pretty big, over 900 000 hashes.
PROGRESS
Using a nathanvda's suggestions, here is the optimized ruby script:
f1 = "c:\\file1.txt"
f2 = "c:\\file2.txt"
hashes = Hash.new(0)
File.open(f1, "r") do |f|
while line = f.gets
hashes[line] += 1
end
end
not_founds = 0
File.open(f2, "r") do |f|
while line = f.gets
if hashes.has_key?(line)
hashes[line] -= 1
else
not_founds += 1
end
end
end
num_errors = hashes.values.to_a.select { |z| z != 0}.size
puts "Total of #{not_founds} lines not found in file2"
puts "Total of #{num_errors} mismatches found"
On windows with Ruby 1.8.7, the original version took 250 seconds and the optimized version took 223.
On a linux VM! running ruby 1.9.1, the original version ran in 81 seconds, about 1/3 the time as windows 1.8.7. Interestingly, the optimized version took longer at 89 seconds. Note that while line = ... was necessary due to memory constraints.
On windows with Ruby 1.9.1, the original took 457 seconds and the optimized version took 543.
On windows with jRuby, the original took 45 seconds and the optimized version took 43.
I am somewhat surprised by the results, I was expecting 1.9.1 to be an improvement over 1.8.7.
I've found Ruby's reference implementation (well, Ruby) to be (unscientifically stated) dog slow.
If you have the opportunity, please try running your program under JRuby! Charles Nutter and other Sun folks claim to have sped Ruby up dramatically.
I for one would be most interested in your results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With