Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an easy way to script a comparison of a web page over time?

I have a website that I want to monitor for changes, specifically in one DIV in the HTML. I was using http://www.followthatpage.com/ to monitor a webpage for changes, but I ran into two issues:

  1. It checks the whole site, not just one DIV
  2. It only checks the site once per hour

Ideally, I would like to write a bash or python script that does a diff of two files every 15 minutes, and emails any changes. I was thinking I might be able to use the diff command after downloading two files, and set it up for a cron to email if there are changes, but I still don't know how to filter only to a specific DIV.

Is there an easier way then figuring out how to do this myself (an existing script)? If not, what would be the best method to do this?

like image 309
EGr Avatar asked Dec 20 '25 09:12

EGr


2 Answers

Another way to do it if you have access to a linux terminal is to add a cronjob

$ crontab -e

and place the following line (everyday at 16:00)

0   16   *   *   *   diff_web_page.sh

where contents of diff_web_page.sh are

#!/bin/bash

URL="http://linux.die.net/man/1/bash";
TMP_FILE="/tmp/diff_page.txt";
if [[ ! -f $TMP_FILE ]]; then
    # First time that we are running, create the file and exit.
    lynx -dump "$URL" &> $TMP_FILE;
    # lynx -dump "$URL" | pcegrep -M "<div>.*</div>" > $TMP_FILE
else
    # the file exist, grub the new version and compare it
    lynx -dump "$URL" &> $TMP_FILE.new; ## use pcegrep if required.
    diff -Npaur $TMP_FILE $TMP_FILE.new;
    mv $TMP_FILE.new $TMP_FILE;
fi

this will email the diff of the webpage every time its executed in the user@host (at the linux box you are running this cron job).

If you want a specific div, you can awk the output with pcregrep -M when dumping the web page with lynx

like image 76
user2599522 Avatar answered Dec 22 '25 23:12

user2599522


Since the div you want is specific to the site, you will probably have to setup a simple check.

This consists of

  • Downloading the HTML - urllib.urlopen(URL) or requests.get(URL).
  • Extracting just the right section (BeautifulSoup, roll your own)
  • Performing your comparison (straight comparison or difflib).

Figuring out what and how to extract the data is going to take you the longest time. I recommend using Developer Tools in Chrome/Firefox.

Let's say we want to know when the counter updates on digitalocean.com. The div for the counter looks like this:

<div class='inner'>
<span class='count'>5</span>
<span class='count'>8</span>
<span class='count'>2</span>
<span class='count_delimiter'>,</span>
<span class='count'>4</span>
<span class='count'>1</span>
<span class='count'>7</span>
</div>

Sadly, there's no id, which would be really easy to pull out using BeautifulSoup4. (e.g. soup.find(id="counter").

Instead, I would elect to pull out all the inner elements that have class "count".

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://www.digitalocean.com')
soup = BeautifulSoup(resp.text)
digits = [tag.getText() for tag in soup.find_all(class_="count")]
count = int(''.join(digits))

BeautifulSoup has excellent documentation for parsing out HTML documents without having to bang your head (depending on how well laid out the site you're scraping is).

like image 33
Kyle Kelley Avatar answered Dec 22 '25 22:12

Kyle Kelley



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!