Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if content of webpage has been changed?

Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.

I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.

So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?

About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.

like image 330
Savad KP Avatar asked Nov 04 '15 07:11

Savad KP


People also ask

How do I track website content changes?

How do I track website content changes? One easy way to track website content changes is to use an RSS feed. RSS, or Really Simple Syndication, is a type of code that allows web users to subscribe to updates from their favorite websites.

Where do you find when a website was last updated?

Perform Google search Paste what you copied into a Google search, or the omnibox at the top of your browser, and then press Enter . At the top of your search results, the displayed date indicates the last time the page was modified or updated.


3 Answers

Safest solution:

download the content and create a hash checksum using SHA512 hash of content, keep it in the db and compare it each time.

Pros: You are not dependent to any Server headers and will detect any modifications.
Cons: Too much bandwidth usage. You have to download all the content every time.

Using Head

Request page using HEAD verb and check the Header Tags:

  • Last-Modified: Server should provide last time page generated or Modified.
  • ETag: A checksum-like value which is defined by server and should change as soon as content changed.

Pros: Much less bandwidth usage and very quick update.
Cons: Not all servers provides and obey following guidelines. Need to get real resource using GET request if you find data is need to fetch

Using GET

Request page using GET verb and using conditional Header Tags: * If-Modified-Since: Server will check if resource modified since following time and return content or return 304 Not Modified

Pros: Still Using less bandwidth, Single trip to receive data.
Cons: Again not all resource support this header.

Finally, maybe mix of above solution is optimum way for doing such action.

like image 199
Ali Nikneshan Avatar answered Sep 25 '22 05:09

Ali Nikneshan


There is no universal solution.

  • Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)
  • Use RSS when possible.
  • Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)
  • Only hash interesting elements of page (build site-specific model) excluding volatile parts
  • Hash whole content (useless for dynamic pages)
like image 26
Basilevs Avatar answered Sep 22 '22 05:09

Basilevs


If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.

Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?

That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.

Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?

You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).

An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.

You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".

You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.

like image 37
Tony Delroy Avatar answered Sep 24 '22 05:09

Tony Delroy