Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - storing crawled pages as static files

Tags:

scrapy

Apologies if this is a scrapy noob question but I have spent ages looking for the answer to this:

I want to store the raw data from each & every URL I crawl in my local filesystem as a separate file (ie response.body -> /files/page123.html) - ideally with the filename being a hash of the URL. This is so I can do further processing of the HTML (ie further parsing, indexing in Solr/ElasticSearch etc).

I've read the docs and not sure if there's a built-in way of doing this? Since the pages are by default being downloaded by the system it doesn't seem to make sense to be writing custom pipelines etc

like image 467
hammondos Avatar asked Oct 26 '25 14:10

hammondos


1 Answers

As paul t said HttpCache Middleware might work for you but I'd advise writing you're own custom pipeline.

Scrapy has built-in ways of exporting data to files but they're for json, xml and csv not raw html. Don't worry though it's not too hard!

provided your items.py looks somthing like:

from scrapy.item import Item, Field

class Listing(Item):
    url = Field()
    html = Field()

and you've been saving your scraped data to those items in your spider like so:

item['url'] = response.url
item['html'] = response.body

your pipelines.py would just be:

import hashlib
class HtmlFilePipeline(object):
    def process_item(self, item, spider):
        # choose whatever hashing func works for you
        file_name = hashlib.sha224(item['url']).hexdigest()
        with open('files/%s.html' % file_name, 'w+b') as f:
            f.write(item['html'])

Hope that helps. Oh and don't forget to put a files/ directory in your project root and add to your settings.py :

ITEM_PIPELINES = {
    'myproject.pipeline.HtmlFilePipeline': 300,
}

source: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

like image 83
NKelner Avatar answered Oct 28 '25 05:10

NKelner



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!