I have sometimes heard esp in context of information retrieval,search engines,crawlers etc that we can detect duplicate pages by hashing content of a page. What kind of hash functions are able to hash an entire web page (which are at least 2 pagers), so that 2 copies have same hash output value?. What is size of a typical hash output value?
Are such hash functions able to put 2 similar web pages with slight typos etc in the same bucket?
Thanks,
Hashing is simply passing some data through a formula that produces a result, called a hash. That hash is usually a string of characters and the hashes generated by a formula are always the same length, regardless of how much data you feed into it. For example, the MD5 formula always produces 32 character-long hashes.
A hashing algorithm is a mathematical algorithm that converts an input data array of a certain type and arbitrary length to an output bit string of a fixed length. Hashing algorithms take any input and convert it to a uniform message by using a hashing table.
Hashing is the process of converting an input of any length into a fixed size string or a number using an algorithm. In hashing, the idea is to use a hash function that converts a given key to a smaller number and uses the small number as an index in a table called a hash table.
Therefore by the NIST standard, the maximum file size can be hashed with SHA-256 is 2^64-1 in bits ( approx 2.305 exabytes - that is close to the lower range of the estimated NSA's data center in UTAH, so you don't need to worry). NIST enables the hash of the size zero message.
Any hash function, given two inputs x and y s.t. x = y, will by definition return the same value for them. But if you want to do this kind of duplicate detection properly, you will need either:
Which one to use really depends on your needs; crypto hashes are useless in near-duplicate detection, since they're designed to map near-duplicates to very different values.
I think you’re looking for fuzzy hashing where only portions of document are hashed instead of the whole document at once.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With