Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later. I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated. So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed? About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.

<h3>Safest solution:</h3> download the content and create a hash checksum using <code>SHA512</code> hash of content, keep it in the db and compare it each time. Pros: You are not dependent to any Server headers and will detect any modifications. Cons: Too much bandwidth usage. You have to download all the content every time. <h3>Using <code>Head</code> </h3> Request page using <code>HEAD</code> verb and check the Header Tags: <ul> <li> <code>Last-Modified</code>: Server should provide last time page generated or Modified. </li> <li> <code>ETag</code>: A checksum-like value which is defined by server and should change as soon as content changed.</li> </ul> Pros: Much less bandwidth usage and very quick update. Cons: Not all servers provides and obey following guidelines. Need to get real resource using <code>GET</code> request if you find data is need to fetch <h3>Using <code>GET</code> </h3> Request page using <code>GET</code> verb and using conditional Header Tags: * <code>If-Modified-Since</code>: Server will check if resource modified since following time and return content or return <code>304 Not Modified</code> Pros: Still Using less bandwidth, Single trip to receive data. Cons: Again not all resource support this header. Finally, maybe mix of above solution is optimum way for doing such action.

There is no universal solution. <ul> <li>Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)</li> <li>Use RSS when possible.</li> <li>Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)</li> <li>Only hash interesting elements of page (build site-specific model) excluding volatile parts</li> <li>Hash whole content (useless for dynamic pages)</li> </ul>

How to check if content of webpage has been changed?

Tags:

compare

hash

python-2.7

web-crawler

Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.

I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.

So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?

About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.

330

asked Nov 04 '15 07:11

Savad KP

3 Answers

Safest solution:

download the content and create a hash checksum using SHA512 hash of content, keep it in the db and compare it each time.

Pros: You are not dependent to any Server headers and will detect any modifications.
Cons: Too much bandwidth usage. You have to download all the content every time.

Using `Head`

Request page using HEAD verb and check the Header Tags:

Last-Modified: Server should provide last time page generated or Modified.
ETag: A checksum-like value which is defined by server and should change as soon as content changed.

Pros: Much less bandwidth usage and very quick update.
Cons: Not all servers provides and obey following guidelines. Need to get real resource using GET request if you find data is need to fetch

Using `GET`

Request page using GET verb and using conditional Header Tags: * If-Modified-Since: Server will check if resource modified since following time and return content or return 304 Not Modified

Pros: Still Using less bandwidth, Single trip to receive data.
Cons: Again not all resource support this header.

Finally, maybe mix of above solution is optimum way for doing such action.

199

answered Sep 25 '22 05:09

Ali Nikneshan

There is no universal solution.

Use If-modifed-since or HEAD when possible (usually ignored by dynamic pages)
Use RSS when possible.
Extract last modification stamp in site-specific way (news sites have publication dates for each article, easily extractable via XPATH)
Only hash interesting elements of page (build site-specific model) excluding volatile parts
Hash whole content (useless for dynamic pages)

answered Sep 22 '22 05:09

Basilevs

If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.

Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?

That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.

Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?

You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).

An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.

You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".

You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.

answered Sep 24 '22 05:09

Tony Delroy

Related questions
                            
                                Descending order using heapq
                            
                                Python error cannot do a non empty take from an empty axes
                            
                                list in Python3.6 [duplicate]
                            
                                Python ImageFont and ImageDraw check font for character support
                            
                                Multiple ways to invoke context manager in python
                            
                                object is subclassed during dynamic type creation but not during classic class definition in python2
                            
                                How to wrap a Python iterator to make it thread safe?
                            
                                Is there a size limit for HTTP response headers on Google App Engine?
                            
                                Bad File Descriptor - Heroku Foreman
                            
                                Python logging: propagate messages of level below current logger level
                            
                                Avoiding unnecessary key evaluations when sorting a list
                            
                                Django custom PasswordResetForm
                            
                                Detecting mulicollinear , or columns that have linear combinations while modelling in Python : LinAlgError
                            
                                Can xlsxwriter use another file as a template?
                            
                                how to read multiple dictionaries from a file in python?
                            
                                Python epsilon is not the smallest number
                            
                                Numpy/scipy deprecation warning for "rank"
                            
                                Conditional removing of duplicates pandas python
                            
                                Sort a subset of a python list to have the same relative order as in other list
                            
                                py2exe 64 bit python 2.7 installation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to check if content of webpage has been changed?

Tags:

compare

hash

python-2.7

web-crawler

Savad KP

People also ask

3 Answers

Safest solution:

Using `Head`

Using `GET`

Ali Nikneshan

Basilevs

Tony Delroy

Recent Activity

Donate For Us

How to check if content of webpage has been changed?

Tags:

compare

hash

python-2.7

web-crawler

Savad KP

People also ask

3 Answers

Safest solution:

Using Head

Using GET

Ali Nikneshan

Basilevs

Tony Delroy

Related questions

Recent Activity

Donate For Us

Using `Head`

Using `GET`