Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maintain updated file cache of web pages in Python?

I'm writing an application that needs persistent local access to large files fetched via http. I want the files saved in a local directory (a partial mirror of some sort), so that subsequent executions of the application simply notice that the URLs have already been mirrored locally, and so that other programs can use them.

Ideally, it would also preserve timestamp or etag information and be able to make a quick http request with an If-Modified-Since or If-None-Match header to check for a new version but avoid a full download unless a file has been updated. However, since these web pages rarely change, I can probably live with bugs from stale copies, and just find other ways to delete files from the cache when appropriate.

Looking around, I see that urllib.request.urlretrieve can save cached copies, but it looks like it can't handle my If-Modified-Since cache-updating goal.

The requests module seems like the latest and greatest, but it doesn't seem to work for this case. There is a CacheControl add-on module which supports my cache-updating goal since it does full HTTP caching. But it seems that it doesn't store the fetched files in a way that is directly usable to other (non-python) programs, since the FileCache stores the resources as pickled data. And the discussion at can python-requests fetch url directly to file handle on disk like curl? - Stack Overflow suggests that saving to a local file can be done with some extra code, but that doesn't seem to mash up well with the CacheControl module.

So is there a web fetching library that does what I want? That can essentially maintain a mirror of files that have been fetched in the past (and tell me what the filenames are), without my having to manage all that explicitly?

like image 707
nealmcb Avatar asked Nov 25 '14 02:11

nealmcb


2 Answers

I had the same requirements and found requests-cache. It was very easy to add it, becuase it extends requests. You can either have the cache be in memory and disappear after your script ends, or make it persistent using sqlite, mongodb, or redis. Here is the two lines I wrote and it worked as advertised:

import requests, requests_cache
requests_cache.install_cache('scraper_cache', backend='sqlite', expire_after=3600)
like image 119
Xavier Ashe Avatar answered Sep 25 '22 16:09

Xavier Ashe


I don't think there is a library that does this, but it isn't too hard to implement. Here's some functions that I'm using with Requests, might help you out:

import os
import os.path as op
import requests
import urlparse

CACHE = 'path/to/cache'

def _file_from_url(url):
    return op.basename(urlparse.urlsplit(url).path)


def is_cached(*args, **kwargs):
    url = kwargs.get('url', args[0] if args else None)
    path = op.join(CACHE, _file_from_url(url))

    if not op.isfile(path):
        return False

    res = requests.head(*args, **kwargs)

    if not res.ok:
        return False

    # Check if cache is stale. For me, checking content-length fitted my use case.
    # You can use modification date or etag here:

    if not 'content-length' in res.headers:
        return False

    return op.getsize(path) == int(res.headers['content-length'])


def update_cache(*args, **kwargs):
    url = kwargs.get('url', args[0] if args else None)
    path = op.join(CACHE, _file_from_url(url))

    res = requests.get(*args, **kwargs)

    if res.ok:
        with open(path, 'wb') as handle:
            for block in res.iter_content(1024):
                if block:
                    handle.write(block)
                    handle.flush()

Usage:

if not is_cached('http://www.google.com/humans.txt'):
    update_cache('http://www.google.com/humans.txt')

# Do something with cache/humans.txt
like image 32
nathancahill Avatar answered Sep 21 '22 16:09

nathancahill