Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to check if content of page has been changed?

I have a crawler that crawl hundred of thousands of pages and index/parse the contents of the page, and one thing I'm struggling with is to check if the content of a page has been updated, in an efficient way, without having to crawl it and check the content of the page.

Obviously I could just load the whole page, and re-parse everything and compare it all to what I have stored in my database. However that is very inefficient and use a lot of computing resulting in high hosting bills.

I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.

So... How would you do this? Would you look at the kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algoritm where the hashes stay the same if only small parts of the string/content has been changed?

like image 479
Marcus Lind Avatar asked Oct 31 '22 10:10

Marcus Lind


1 Answers

You could try and use the value contained in the "last-mofidied" header in the response from the server. Parsing this into a nice object would allow for simple date comparisons, letting you check if you should re-scrape. For example (in Python using the brilliant requests library:

import requests
r = requests.get('http://en.wikipedia.org/wiki/Monty_Python')
site_last_modified_date = r.headers["Last-Modified"]

# from here, just parse the date and compare it with the last recorded date
like image 190
iainjames9 Avatar answered Nov 11 '22 19:11

iainjames9