Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a hashing algorithm that is tolerant of minor differences?

I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.

Are there any hashing algorithms that work for something like this?

like image 714
Jason Baker Avatar asked Apr 13 '11 22:04

Jason Baker


People also ask

What is the most secure hashing algorithm?

Common attacks like brute force attacks can take years or even decades to crack the hash digest, so SHA-2 is considered the most secure hash algorithm.

Why can hashing algorithms still be made reliable enough?

The reasons why hashing algorithms are considered safe are due to the following: They are irreversible. You can't get to the input data by reverse-engineering the output hash value. A small change in the input will produce a vastly different hash value.

What is weak hashing algorithm?

Weak hashing algorithms are those that have been proven to be of high risk, or even completely broken, and thus are not fit for use.

What are the hashing algorithms being used today?

There are multiple types of hashing algorithms, but the most common are Message Digest 5 (MD5) and Secure Hashing Algorithm (SHA) 1 and 2. The slightest change in the data will result in a dramatic difference in the resulting hash values.


2 Answers

A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.

I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.

like image 189
Jim Mischel Avatar answered Nov 16 '22 02:11

Jim Mischel


This might be a good place to use the Levenshtein distance metric, which quantifies the amount of editing required to transform one sequence into another.

The drawback of this approach is that you'd need to keep the full text of each page so that you could compare them later. With a hash-based approach, on the other hand, you simply store some sort of small computed value and don't require the previous full text for comparison.

You also might try some sort of hybrid approach--let a hashing algorithm tell you that any change has been made, and use it as a trigger to retrieve an archival copy of the document for more rigorous (Levenshtein) comparison.

like image 32
Drew Hall Avatar answered Nov 16 '22 03:11

Drew Hall