Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does hashing of entire content of a web page work?

I have sometimes heard esp in context of information retrieval,search engines,crawlers etc that we can detect duplicate pages by hashing content of a page. What kind of hash functions are able to hash an entire web page (which are at least 2 pagers), so that 2 copies have same hash output value?. What is size of a typical hash output value?

Are such hash functions able to put 2 similar web pages with slight typos etc in the same bucket?

Thanks,

like image 432
xyz Avatar asked Apr 30 '11 10:04

xyz


People also ask

What does it mean to hash a website?

Hashing is simply passing some data through a formula that produces a result, called a hash. That hash is usually a string of characters and the hashes generated by a formula are always the same length, regardless of how much data you feed into it. For example, the MD5 formula always produces 32 character-long hashes.

How does hashing work What does it do?

A hashing algorithm is a mathematical algorithm that converts an input data array of a certain type and arbitrary length to an output bit string of a fixed length. Hashing algorithms take any input and convert it to a uniform message by using a hashing table.

How does hashing in data structure works?

Hashing is the process of converting an input of any length into a fixed size string or a number using an algorithm. In hashing, the idea is to use a hash function that converts a given key to a smaller number and uses the small number as an index in a table called a hash table.

How much data can be hashed?

Therefore by the NIST standard, the maximum file size can be hashed with SHA-256 is 2^64-1 in bits ( approx 2.305 exabytes - that is close to the lower range of the estimated NSA's data center in UTAH, so you don't need to worry). NIST enables the hash of the size zero message.


2 Answers

Any hash function, given two inputs x and y s.t. x = y, will by definition return the same value for them. But if you want to do this kind of duplicate detection properly, you will need either:

  • a cryptographically strong hash function such as MD5, SHA-1 or SHA-512, which will practically never map two different pages to the same value so you can assume an equal hash value means equal input, or
  • a locality sensitive hash function if you want to detect near-duplicates.

Which one to use really depends on your needs; crypto hashes are useless in near-duplicate detection, since they're designed to map near-duplicates to very different values.

like image 198
Fred Foo Avatar answered Nov 15 '22 06:11

Fred Foo


I think you’re looking for fuzzy hashing where only portions of document are hashed instead of the whole document at once.

like image 42
Gumbo Avatar answered Nov 15 '22 06:11

Gumbo