Scrapy Crawl URLs in Order

Question

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from mlbodds.items import MlboddsItem  class MLBoddsSpider(BaseSpider):    name = "sbrforum.com"    allowed_domains = ["sbrforum.com"]    start_urls = [        "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",        "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",        "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"    ]     def parse(self, response):        hxs = HtmlXPathSelector(response)        sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')        items = []        for site in sites:            item = MlboddsItem()            item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()            item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()            items.append(item)        return items

The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.

Sandeep Balagopal · Accepted Answer

Scrapy Request has a priority attribute now.

If you have many Request in a function and want to process a particular request first, you can set:

def parse(self, response):     url = 'http://www.example.com/first'     yield Request(url=url, callback=self.parse_data, priority=1)      url = 'http://www.example.com/second'     yield Request(url=url, callback=self.parse_data)

Scrapy will process the one with priority=1 first.

Scrapy Crawl URLs in Order

Tags:

python

asynchronous

hashmap

sorting

scrapy

Jeff

1 Answers

Sandeep Balagopal

Recent Activity

Donate For Us

Scrapy Crawl URLs in Order

Tags:

python

asynchronous

hashmap

sorting

scrapy

Jeff

1 Answers

Sandeep Balagopal

Related questions

Recent Activity

Donate For Us