How does hashing of entire content of a web page work?

Tags:

I have sometimes heard esp in context of information retrieval,search engines,crawlers etc that we can detect duplicate pages by hashing content of a page. What kind of hash functions are able to hash an entire web page (which are at least 2 pagers), so that 2 copies have same hash output value?. What is size of a typical hash output value?

Are such hash functions able to put 2 similar web pages with slight typos etc in the same bucket?

Thanks,

432

asked Apr 30 '11 10:04

xyz

2 Answers

Any hash function, given two inputs x and y s.t. x = y, will by definition return the same value for them. But if you want to do this kind of duplicate detection properly, you will need either:

a cryptographically strong hash function such as MD5, SHA-1 or SHA-512, which will practically never map two different pages to the same value so you can assume an equal hash value means equal input, or
a locality sensitive hash function if you want to detect near-duplicates.

Which one to use really depends on your needs; crypto hashes are useless in near-duplicate detection, since they're designed to map near-duplicates to very different values.

198

answered Nov 15 '22 06:11

Fred Foo

I think you’re looking for fuzzy hashing where only portions of document are hashed instead of the whole document at once.

answered Nov 15 '22 06:11

Gumbo

Related questions
                            
                                How to make a controlled "shuffle" order?
                            
                                Binary numbers with the same quantity of 0s and 1s
                            
                                Custom Asymmetric Cryptography Algorithm
                            
                                What are some practical applications of the ROT13 algorithm?
                            
                                Is there any algorithm needs functional language exclusively to be implemented
                            
                                Is there a simpler way than this to calculate a straight in poker?
                            
                                to develop an internet messenger what should i do?
                            
                                Sliding window minimum algorithm
                            
                                Dividing a list of numbers into roughtly equal totals
                            
                                Algorithm for Rendering Long Text in a Text Editor
                            
                                Algorithm for computing the relevance of a keyword to a short text (50 - 100 words)
                            
                                A question about matrix manipulation
                            
                                Algorithm finding path in an undirected tree
                            
                                How to effectively find areas in two-dimensional array?
                            
                                Algorithm to find "most common elements" in different arrays
                            
                                Is there an algorithm for estimating clock-skew that will work over Http?
                            
                                How to store Hierarchical K-Means tree for a large number of images, using Opencv?
                            
                                How to print a binary tree level by level? Interview question!
                            
                                algorithm: find count of numbers within a given range
                            
                                Compare two spectogram to find the offset where they match algorithm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does hashing of entire content of a web page work?

Tags:

algorithm

indexing

data-structures

hash

search-engine

xyz

People also ask

2 Answers

Fred Foo

Gumbo

Recent Activity

Donate For Us