Is there a hashing algorithm that is tolerant of minor differences?

Tags:

I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.

Are there any hashing algorithms that work for something like this?

714

asked Apr 13 '11 22:04

Jason Baker

2 Answers

A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.

I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.

189

answered Nov 16 '22 02:11

Jim Mischel

This might be a good place to use the Levenshtein distance metric, which quantifies the amount of editing required to transform one sequence into another.

The drawback of this approach is that you'd need to keep the full text of each page so that you could compare them later. With a hash-based approach, on the other hand, you simply store some sort of small computed value and don't require the previous full text for comparison.

You also might try some sort of hybrid approach--let a hashing algorithm tell you that any change has been made, and use it as a trigger to retrieve an archival copy of the document for more rigorous (Levenshtein) comparison.

answered Nov 16 '22 03:11

Drew Hall

Related questions
                            
                                How can we split one 100 GB file into hundred 1 GB file?
                            
                                maximum sum subrectangle in a sparse matrix
                            
                                Find two pairs of pairs that sum to the same value
                            
                                Counting inversions in ranges
                            
                                Random "walk" around a central location in a limited area?
                            
                                merge n coins with minimum cost to create one single coin
                            
                                Number of ways to make change for amount N
                            
                                Algorithm to determine if number is between two numbers in modular arithmetic
                            
                                Finding the closest number that factors given a list of primes
                            
                                Untying Knuth's knots: how to restructure spaghetti code?
                            
                                Unable to understand algorithm
                            
                                remove elements from link list whose sum equals to zero
                            
                                How to improve the performance of Leetcode 4sum-ii challenge
                            
                                Aggregation of array data over a given dimension
                            
                                Hungarian algorithm: multiple jobs per worker
                            
                                When (not how or why) to calculate Big O of an algorithm
                            
                                How to find the "center" of a subset of vertices in a graph?
                            
                                How a marker-based augmented reality algorithm (like ARToolkit's one) works?
                            
                                Algorithm to "transfer water from a set of bottles to another one" (metaphorically speaking)
                            
                                Solving a Linear Diophantine Equation(see description for examples)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a hashing algorithm that is tolerant of minor differences?

Tags:

algorithm

caching

hash

web-crawler

Jason Baker

People also ask

2 Answers

Jim Mischel

Drew Hall

Recent Activity

Donate For Us