Is there an easy way to script a comparison of a web page over time?

Question

I have a website that I want to monitor for changes, specifically in one DIV in the HTML. I was using http://www.followthatpage.com/ to monitor a webpage for changes, but I ran into two issues:

It checks the whole site, not just one DIV
It only checks the site once per hour

Ideally, I would like to write a bash or python script that does a diff of two files every 15 minutes, and emails any changes. I was thinking I might be able to use the diff command after downloading two files, and set it up for a cron to email if there are changes, but I still don't know how to filter only to a specific DIV.

Is there an easier way then figuring out how to do this myself (an existing script)? If not, what would be the best method to do this?

user2599522 · Accepted Answer

Another way to do it if you have access to a linux terminal is to add a cronjob

$ crontab -e

and place the following line (everyday at 16:00)

0   16   *   *   *   diff_web_page.sh

where contents of diff_web_page.sh are

#!/bin/bash

URL="http://linux.die.net/man/1/bash";
TMP_FILE="/tmp/diff_page.txt";
if [[ ! -f $TMP_FILE ]]; then
    # First time that we are running, create the file and exit.
    lynx -dump "$URL" &> $TMP_FILE;
    # lynx -dump "$URL" | pcegrep -M "<div>.*</div>" > $TMP_FILE
else
    # the file exist, grub the new version and compare it
    lynx -dump "$URL" &> $TMP_FILE.new; ## use pcegrep if required.
    diff -Npaur $TMP_FILE $TMP_FILE.new;
    mv $TMP_FILE.new $TMP_FILE;
fi

this will email the diff of the webpage every time its executed in the user@host (at the linux box you are running this cron job).

If you want a specific div, you can awk the output with pcregrep -M when dumping the web page with lynx

Kyle Kelley · Answer

Since the div you want is specific to the site, you will probably have to setup a simple check.

This consists of

Downloading the HTML - urllib.urlopen(URL) or requests.get(URL).
Extracting just the right section (BeautifulSoup, roll your own)
Performing your comparison (straight comparison or difflib).

Figuring out what and how to extract the data is going to take you the longest time. I recommend using Developer Tools in Chrome/Firefox.

Let's say we want to know when the counter updates on digitalocean.com. The div for the counter looks like this:

<div class='inner'>
<span class='count'>5</span>
<span class='count'>8</span>
<span class='count'>2</span>
<span class='count_delimiter'>,</span>
<span class='count'>4</span>
<span class='count'>1</span>
<span class='count'>7</span>
</div>

Sadly, there's no id, which would be really easy to pull out using BeautifulSoup4. (e.g. soup.find(id="counter").

Instead, I would elect to pull out all the inner elements that have class "count".

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://www.digitalocean.com')
soup = BeautifulSoup(resp.text)
digits = [tag.getText() for tag in soup.find_all(class_="count")]
count = int(''.join(digits))

BeautifulSoup has excellent documentation for parsing out HTML documents without having to bang your head (depending on how well laid out the site you're scraping is).

Is there an easy way to script a comparison of a web page over time?

Tags:

python

html

bash

compare

diff

EGr

2 Answers

user2599522

Kyle Kelley

Recent Activity

Donate For Us

Is there an easy way to script a comparison of a web page over time?

Tags:

python

html

bash

compare

diff

EGr

2 Answers

user2599522

Kyle Kelley

Related questions

Recent Activity

Donate For Us