Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine if two web pages are the same?

Tags:

algorithm

diff

What are some of techniques good for detecting if a webpage is the same as another?

By same, I don't mean char-for-char equivalent (that's easy), but is robust enough to ignore something like a current date/time on the page, etc.

E.g., go a Yahoo! News article load the page, open the same page 10 minutes later in another browser. Baring rewrites, those pages will have some differences (timestamps, possibly things like ads, possibly things like related stories), but a human could look at the two and say they're the same.

Note I'm not trying to fix (or rely) on URL normalization. I.e., figuring out that foo.html & foo.html?bar=bang are the same.

like image 425
Bill Avatar asked Jan 19 '09 01:01

Bill


3 Answers

It sounds like you are after a robust way to measure the similarity of two pages.

Given that the structure of the page won't change that much, we can reduce the problem to testing whether the text on the page is roughly the same. Of course, with this approach the problems alluded to by nickf regarding a photographers page are still there but if you are mainly concerned with Yahoo! news or the like this should be okay.

To compare to pages, you can use a method from machine learning called "string kernels". Here's an early paper a recent set of slides on a R package and a video lecture.

Very roughly, a string kernel looks for how many words, pairs of words, triples of words, etc two documents have in common. If A and B are two documents and k is a string kernel then the higher the value of k(A,B) the more similar the two documents.

If you set a threshold t and only say two documents are the same for k(A,B) > t you should have a reasonably good way of doing what you want. Of course, you'll have to tune the threshold to get the best results for your application.

like image 108
Mark Reid Avatar answered Oct 27 '22 13:10

Mark Reid


For this kind of problem I find a search through academic papers to be much better than asking StackOverflow, when dealing with specifics the experts are often much smarter than the crowd.

Every webcrawler or search engine has this problem and has solved it. There is probably a good approach using a kernel based method like the accepted answer is suggesting, but you probably want to start with simpler techniques that are known to work well. You can move to kernel methods afterwards and test to see if they improve your results.

Your best bet is to read Henzinger's 2006 paper 'Finding near-duplicate web pages: a large scale evaluation of algorithms'

and you'll probably be looking at generating a Rabin fingerprint as a first step with 'Fingerprinting by random polynomials' Rabin 1986.

like image 29
Jesse Sherlock Avatar answered Oct 27 '22 12:10

Jesse Sherlock


You can detect that two pages are the same by using some sort of similarity metric such as the cosine similarity. Then you would have to define a minimum threshold that you can use to accept whether the two documents are the same. For example, I would pick a value closest to 1 when applying the cosine measure, since it ranges from -1 for totally different and 1 for identical.

like image 34
Marcel Avatar answered Oct 27 '22 13:10

Marcel