I want to create separate output file for every url I have set in start_urls of spider or somehow want to split ouput files start url wise.
Following is the start_urls of my spider
start_urls = ['http://www.dmoz.org/Arts/', 'http://www.dmoz.org/Business/', 'http://www.dmoz.org/Computers/']
I want to create separate output file like
Arts.xml
Business.xml
Computers.xml
I don't know exactly how to do this. I am thinking to achieve this by implementing some thing like following in spider_opened method of item pipeline class,
import re
from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter
class CleanDataPipeline(object):
def __init__(self):
self.cnt = 0
self.filename = ''
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
referer_url = response.request.headers.get('referer', None)
if referer_url in spider.start_urls:
catname = re.search(r'/(.*)$', referer_url, re.I)
self.filename = catname.group(1)
file = open('output/' + str(self.cnt) + '_' + self.filename + '.xml', 'w+b')
self.exporter = XmlItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
#file.close()
def process_item(self, item, spider):
self.cnt = self.cnt + 1
self.spider_closed(spider)
self.spider_opened(spider)
self.exporter.export_item(item)
return item
Where I am trying to find the referer url of every scraped item within the start_urls list. If referer url is found in start_urls then file name will be created using that referer url. But problem is how to access response object inside spider_opened() method. If I can access it there, I can create file based on that.
Any help to find a way to perform this? Thanks in advance!
I'd implement a more explicit approach (not tested):
configure list of possible categories in settings.py
:
CATEGORIES = ['Arts', 'Business', 'Computers']
define your start_urls
based on the setting
start_urls = ['http://www.dmoz.org/%s' % category for category in settings.CATEGORIES]
add category
Field
to the Item
class
in the spider's parse method set the category
field according to the current response.url
, e.g.:
def parse(self, response):
...
item['category'] = next(category for category in settings.CATEGORIES if category in response.url)
...
in the pipeline open up exporters for all categories and choose which exporter to use based on the item['category']
:
def spider_opened(self, spider):
...
self.exporters = {}
for category in settings.CATEGORIES:
file = open('output/%s.xml' % category, 'w+b')
exporter = XmlItemExporter(file)
exporter.start_exporting()
self.exporters[category] = exporter
def spider_closed(self, spider):
for exporter in self.exporters.itervalues():
exporter.finish_exporting()
def process_item(self, item, spider):
self.exporters[item['category']].export_item(item)
return item
You would probably need to tweak it a bit to make it work but I hope you got the idea - store the category inside the item
being processed. Choose a file to export to based on the item category value.
Hope that helps.
As long as you don't store it in the item itself, you can't really know the staring url. The following solution should work for you:
redefine the make_request_from_url
to send the starting url with each Request
you make. You can store it in meta
attribute of your Request
. Bypass this starting url with each following Request
.
as soon as you decide to pass the element to pipeline, fill in the starting url for the item from response.meta['start_url']
Hope it helps. Following links may be helpful:
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.make_requests_from_url
http://doc.scrapy.org/en/latest/topics/request-response.html?highlight=meta#passing-additional-data-to-callback-functions
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With