So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from mlbodds.items import MlboddsItem class MLBoddsSpider(BaseSpider): name = "sbrforum.com" allowed_domains = ["sbrforum.com"] start_urls = [ "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/", "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/", "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]') items = [] for site in sites: item = MlboddsItem() item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract() item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract() items.append(item) return items
The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.
Scrapy Request
has a priority
attribute now.
If you have many Request
in a function and want to process a particular request first, you can set:
def parse(self, response): url = 'http://www.example.com/first' yield Request(url=url, callback=self.parse_data, priority=1) url = 'http://www.example.com/second' yield Request(url=url, callback=self.parse_data)
Scrapy will process the one with priority=1
first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With