Pairing content on an external website with entries in an mySQL database

Question

tl;dr: I'm looking for a way to find entries in our database which are missing information, getting that information from a website and adding it to the database entry.

We have a media management program which uses a mySQL table to store the information. When employees download media (video files, images, audio files) and import it into the media manager they are suppose to also copy the description of the media (from the source website) and add it to the description in the Media Manager. However this has not been done for thousands of files.

The file name (eg. file123.mov) is unique and the details page for that file can be accessed by going to a URL on the source website:

website.com/content/file123

The information we want to scrape from that page has an element ID which is always the same.

In my mind the process would be:

Connect to database and Load table

Filter: "format" is "Still Image (JPEG)"

Filter: "description" is "NULL"

Get first result

Get "FILENAME" without extension)

Load the URL: website.com/content/FILENAME

Copy contents of the element "description" (on website)

Paste contents into the "description" (SQL entry)

Get 2nd result

Rinse and repeat until last result is reached

My question(s) are:

Is there software that could perform such a task or is this something that would need to be scripted?
If scripted, what would be the best type of script (eg could I achieve this using AppleScript or would it need to be made in java or php etc.)

eggyal · Accepted Answer

Is there software that could perform such a task or is this something that would need to be scripted?

I'm not aware of anything that will do what you want out of the box (and even if there was, the configuration required won't be much less work than the scripting involved in rolling your own solution).

If scripted, what would be the best type of script (eg could I achieve this using AppleScript or would it need to be made in java or php etc.)

AppleScript can't connect to databases, so you will definitely need to throw something else into the mix. If the choice is between Java and PHP (and you're equally familiar with both), I'd definitely recommend PHP for this purpose, as there will be considerably less code involved.

Your PHP script would look something like this:

$BASEURL  = 'http://website.com/content/';

// connect to the database
$dbh = new PDO($DSN, $USERNAME, $PASSWORD);

// query for files without descriptions
$qry = $dbh->query("
  SELECT FILENAME FROM mytable
  WHERE  format = 'Still Image (JPEG)' AND description IS NULL
");

// prepare an update statement
$update = $dbh->prepare('
  UPDATE mytable SET description = :d WHERE FILENAME = :f
');

$update->bindParam(':d', $DESCRIPTION);
$update->bindParam(':f', $FILENAME);

// loop over the files
while ($FILENAME = $qry->fetchColumn()) {
  // construct URL
  $i = strrpos($FILENAME, '.');
  $url = $BASEURL . (($i === false) ? $FILENAME : substr($FILENAME, 0, $i));

  // fetch the document
  $doc = new DOMDocument();
  $doc->loadHTMLFile($url);

  // get the description
  $DESCRIPTION = $doc->getElementsById('description')->nodeValue;

  // update the database
  $update->execute();
}

GrantJ · Answer

I too am not aware of any existing software packages that will do everything you're looking for. However, Python can connect to your database, make web requests easily, and handle dirty html. Assuming you already have Python installed, you'll need three packages:

MySQLdb for connecting to the database.
Requests for easily making http web requests.
BeautifulSoup for robust parsing of html.

You can install these packages with pip commands or Windows installers. Appropriate instructions are on each site. The whole process won't take more than 10 minutes.

import MySQLdb as db
import os.path
import requests
from bs4 import BeautifulSoup

# Connect to the database. Fill in these fields as necessary.

con = db.connect(host='hostname', user='username', passwd='password',
                 db='dbname')

# Create and execute our SELECT sql statement.

select = con.cursor()
select.execute('SELECT filename FROM table_name \
                WHERE format = ? AND description = NULL',
               ('Still Image (JPEG)',))

while True:
    # Fetch a row from the result of the SELECT statement.

    row = select.fetchone()
    if row is None: break

    # Use Python's built-in os.path.splitext to split the extension
    # and get the url_name.

    filename = row[0]
    url_name = os.path.splitext(filename)[0]
    url = 'http://www.website.com/content/' + url_name

    # Make the web request. You may want to rate-limit your requests
    # so that the website doesn't get angry. You can slow down the
    # rate by inserting a pause with:
    #               
    # import time   # You can put this at the top with other imports
    # time.sleep(1) # This will wait 1 second.

    response = requests.get(url)
    if response.status_code != 200:

        # Don't worry about skipped urls. Just re-run this script
        # on spurious or network-related errors.

        print 'Error accessing:', url, 'SKIPPING'
        continue

    # Parse the result. BeautifulSoup does a great job handling
    # mal-formed input.

    soup = BeautifulSoup(response.content)
    description = soup.find('div', {'id': 'description'}).contents

    # And finally, update the database with another query.

    update = db.cursor()
    update.execute('UPDATE table_name SET description = ? \
                    WHERE filename = ?',
                   (description, filename))

I'll warn that I've made a good effort to make that code "look right" but I haven't actually tested it. You'll need to fill in the private details.

Kohjah Breese · Answer

PHP is a good scraper . I have made a class that wraps the cURL port of PHP here:

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

You'll probably need to use some of the options:

http://www.php.net/manual/en/function.curl-setopt.php

To scrape HTML, I usually use regular expressions, but here is a class I made that should be able to query HTML without issues:

http://pastebin.com/Jm9jKjAU

Useage is:

$h = new HTMLQuery();
$h->load( $string_containing_html );
$h->getElements( 'p', 'id' ); // Returns all p tags with an id attribute

The best option to scrape would be XPath, but it can't handle dirty HTML. You can use that to do things like:

//div[@class = 'itm']/p[last() and text() = 'Hello World'] <- selects the last p in div elements that have the innerHTML 'Hello World'

You can use that in PHP with the DOM class (built-in).

Pairing content on an external website with entries in an mySQL database

Tags:

mysql

html-parsing

OrangeBox

3 Answers

eggyal

GrantJ

Kohjah Breese

Recent Activity

Donate For Us

Pairing content on an external website with entries in an mySQL database

Tags:

mysql

html-parsing

OrangeBox

3 Answers

eggyal

GrantJ

Kohjah Breese

Related questions

Recent Activity

Donate For Us