Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Crawl URLs in Order

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from mlbodds.items import MlboddsItem  class MLBoddsSpider(BaseSpider):    name = "sbrforum.com"    allowed_domains = ["sbrforum.com"]    start_urls = [        "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",        "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",        "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"    ]     def parse(self, response):        hxs = HtmlXPathSelector(response)        sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')        items = []        for site in sites:            item = MlboddsItem()            item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()            item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()            items.append(item)        return items 

The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.

like image 937
Jeff Avatar asked Jul 04 '11 00:07

Jeff


1 Answers

Scrapy Request has a priority attribute now.

If you have many Request in a function and want to process a particular request first, you can set:

def parse(self, response):     url = 'http://www.example.com/first'     yield Request(url=url, callback=self.parse_data, priority=1)      url = 'http://www.example.com/second'     yield Request(url=url, callback=self.parse_data) 

Scrapy will process the one with priority=1 first.

like image 169
Sandeep Balagopal Avatar answered Oct 02 '22 19:10

Sandeep Balagopal