Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Am I misunderstanding String#hash in Ruby?

I am processing a bunch of data and I haven't coded a duplicate checker into the data processor yet, so I expected duplicates to occur. I ran the following SQL query:

SELECT     body, COUNT(body) AS dup_count 
FROM         comments
GROUP BY body
HAVING     (COUNT(body) > 1) 

And get back a list of duplicates. Looking into this I find that these duplicates have multiple hashes. The shortest string of a comment is "[deleted]". So let's use that as an example. In my database there are nine instances of a comment being "[deleted]" and in my database this produces a hash of both 1169143752200809218 and 1738115474508091027. The 116 is found 6 times and 173 is found 3 times. But, when I run it in IRB, I get the following:

a = '[deleted]'.hash # => 811866697208321010

Here is the code I'm using to produce the hash:

def comment_and_hash(chunk)     
  comment = chunk.at_xpath('*/span[@class="comment"]').text ##Get Comment##
  hash = comment.hash
  return comment,hash
end

I've confirmed that I don't touch comment anywhere else in my code. Here is my datamapper class.

class Comment

    include DataMapper::Resource

    property :uid       , Serial
    property :author    , String
    property :date      , Date
    property :body      , Text
    property :arank     , Float 
    property :srank     , Float 
    property :parent    , Integer #Should Be UID of another comment or blank if parent
    property :value     , Integer #Hash to prevent duplicates from occurring

end

Am I correct in assuming that .hash on a string will return the same value each time it is called on the same string?

Which value is the correct value assuming my string consists of "[deleted]"?

Is there a way I could have different strings inside ruby, but SQL would see them as the same string? That seems to be the most plausible explanation for why this is occurring, but I'm really shooting in the dark.

like image 715
Noah Clark Avatar asked Oct 12 '11 02:10

Noah Clark


3 Answers

If you run

ruby -e "puts '[deleted]'.hash"

several times, you will notice that the value is different. In fact, the hash value stays only constant as long as your Ruby process is alive. The reason for this is that String#hash is seeded with a random value. rb_str_hash (the C implementing function) uses rb_hash_start which uses this random seed which gets initialized every time Ruby is spawned.

You could use a CRC such as Zlib#crc32 for your purposes or you may want to use one of the message digests of OpenSSL::Digest, although the latter is overkill since for detection of duplicates you probably won't need the security properties.

like image 164
emboss Avatar answered Sep 23 '22 01:09

emboss


I use the following to create String#hash alternatives that are consistant across time and processes

require 'zlib'

def generate_id(label)
  Zlib.crc32(label.to_s) % (2 ** 30 - 1)
end
like image 34
Steve Wilhelm Avatar answered Sep 21 '22 01:09

Steve Wilhelm


Ruby intentionally makes String.hash produce different values in different sessions: Why is Ruby String.hash inconsistent across machines?

like image 34
Ned Batchelder Avatar answered Sep 21 '22 01:09

Ned Batchelder