What are some of techniques good for detecting if a webpage is the same as another?
By same, I don't mean char-for-char equivalent (that's easy), but is robust enough to ignore something like a current date/time on the page, etc.
E.g., go a Yahoo! News article load the page, open the same page 10 minutes later in another browser. Baring rewrites, those pages will have some differences (timestamps, possibly things like ads, possibly things like related stories), but a human could look at the two and say they're the same.
Note I'm not trying to fix (or rely) on URL normalization. I.e., figuring out that foo.html & foo.html?bar=bang are the same.
It sounds like you are after a robust way to measure the similarity of two pages.
Given that the structure of the page won't change that much, we can reduce the problem to testing whether the text on the page is roughly the same. Of course, with this approach the problems alluded to by nickf regarding a photographers page are still there but if you are mainly concerned with Yahoo! news or the like this should be okay.
To compare to pages, you can use a method from machine learning called "string kernels". Here's an early paper a recent set of slides on a R package and a video lecture.
Very roughly, a string kernel looks for how many words, pairs of words, triples of words, etc two documents have in common. If A and B are two documents and k is a string kernel then the higher the value of k(A,B) the more similar the two documents.
If you set a threshold t and only say two documents are the same for k(A,B) > t you should have a reasonably good way of doing what you want. Of course, you'll have to tune the threshold to get the best results for your application.
For this kind of problem I find a search through academic papers to be much better than asking StackOverflow, when dealing with specifics the experts are often much smarter than the crowd.
Every webcrawler or search engine has this problem and has solved it. There is probably a good approach using a kernel based method like the accepted answer is suggesting, but you probably want to start with simpler techniques that are known to work well. You can move to kernel methods afterwards and test to see if they improve your results.
Your best bet is to read Henzinger's 2006 paper 'Finding near-duplicate web pages: a large scale evaluation of algorithms'
and you'll probably be looking at generating a Rabin fingerprint as a first step with 'Fingerprinting by random polynomials' Rabin 1986.
You can detect that two pages are the same by using some sort of similarity metric such as the cosine similarity. Then you would have to define a minimum threshold that you can use to accept whether the two documents are the same. For example, I would pick a value closest to 1 when applying the cosine measure, since it ranges from -1 for totally different and 1 for identical.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With