I have snapshots of multiple webpages taken at 2 times. What is a reliable method to determine which webpages have been modified?
I can't rely on something like an RSS feed, and I need to ignore minor noise like date text.
Ideally I am looking for a Python solution, but an intuitive algorithm would also be great.
Thanks!
Right-click the file and select Properties. In the Properties window, the Created date, Modified date, and Accessed date is displayed, similar to the example below.
Go to the Wayback Machine website. Type or paste the desired URL into the search box and click the Browse History button or press Enter. If the search succeeds, you'll get to see how many times the Wayback Machine saved the site info as well as when, represented by colored dots behind the days each snapshot was taken.
Well, first you need to decide what is noise and what isn't. You can use a HTML parser like BeautifulSoup to remove the noise, pretty-print the result, and compare it as a string.
If you are looking for an automatic solution, you can use difflib.SequenceMatcher
to calculate the differences between the pages, calculate the similarity and compare it to a threshold.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With