Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to determine if webpage has been modified

I have snapshots of multiple webpages taken at 2 times. What is a reliable method to determine which webpages have been modified?

I can't rely on something like an RSS feed, and I need to ignore minor noise like date text.

Ideally I am looking for a Python solution, but an intuitive algorithm would also be great.

Thanks!

like image 785
hoju Avatar asked Oct 19 '09 10:10

hoju


People also ask

How can you tell when something was last edited?

Right-click the file and select Properties. In the Properties window, the Created date, Modified date, and Accessed date is displayed, similar to the example below.

How do you tell when a webpage was taken down?

Go to the Wayback Machine website. Type or paste the desired URL into the search box and click the Browse History button or press Enter. If the search succeeds, you'll get to see how many times the Wayback Machine saved the site info as well as when, represented by colored dots behind the days each snapshot was taken.


1 Answers

Well, first you need to decide what is noise and what isn't. You can use a HTML parser like BeautifulSoup to remove the noise, pretty-print the result, and compare it as a string.

If you are looking for an automatic solution, you can use difflib.SequenceMatcher to calculate the differences between the pages, calculate the similarity and compare it to a threshold.

like image 152
Lukáš Lalinský Avatar answered Sep 19 '22 12:09

Lukáš Lalinský