I am processing a bunch of data and I haven't coded a duplicate checker into the data processor yet, so I expected duplicates to occur. I ran the following SQL query:
SELECT body, COUNT(body) AS dup_count
FROM comments
GROUP BY body
HAVING (COUNT(body) > 1)
And get back a list of duplicates. Looking into this I find that these duplicates have multiple hashes. The shortest string of a comment is "[deleted]"
. So let's use that as an example. In my database there are nine instances of a comment being "[deleted]"
and in my database this produces a hash of both 1169143752200809218 and 1738115474508091027. The 116 is found 6 times and 173 is found 3 times. But, when I run it in IRB, I get the following:
a = '[deleted]'.hash # => 811866697208321010
Here is the code I'm using to produce the hash:
def comment_and_hash(chunk)
comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment##
hash = comment.hash
return comment,hash
end
I've confirmed that I don't touch comment anywhere else in my code. Here is my datamapper class.
class Comment
include DataMapper::Resource
property :uid , Serial
property :author , String
property :date , Date
property :body , Text
property :arank , Float
property :srank , Float
property :parent , Integer #Should Be UID of another comment or blank if parent
property :value , Integer #Hash to prevent duplicates from occurring
end
Am I correct in assuming that .hash
on a string will return the same value each time it is called on the same string?
Which value is the correct value assuming my string consists of "[deleted]"
?
Is there a way I could have different strings inside ruby, but SQL would see them as the same string? That seems to be the most plausible explanation for why this is occurring, but I'm really shooting in the dark.
If you run
ruby -e "puts '[deleted]'.hash"
several times, you will notice that the value is different. In fact, the hash value stays only constant as long as your Ruby process is alive. The reason for this is that String#hash
is seeded with a random value. rb_str_hash
(the C implementing function) uses rb_hash_start which uses this random seed which gets initialized every time Ruby is spawned.
You could use a CRC such as Zlib#crc32 for your purposes or you may want to use one of the message digests of OpenSSL::Digest
, although the latter is overkill since for detection of duplicates you probably won't need the security properties.
I use the following to create String#hash alternatives that are consistant across time and processes
require 'zlib'
def generate_id(label)
Zlib.crc32(label.to_s) % (2 ** 30 - 1)
end
Ruby intentionally makes String.hash
produce different values in different sessions: Why is Ruby String.hash inconsistent across machines?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With