I am stuck with this problem in Scrapy:
I am trying to fill my item in the function parse_additional_info and to do so I need to scrape a bunch of additional url in a second callback parse_player:
for path in path_player:
url = path.xpath('url_extractor').extract()[0]
yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)
When I do so my understanding is that the requests are executed asynchronously later on, filling item , however the yield item returns it immediately incompletely filled.
I know it is not possible to wait for all the yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300) to complete, but how would you solve this problem? i.e. making sure the item yield is done when all the infos from the request have completed.
from scrapy.spiders import Spider, CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
from datetime import datetime
from footscript.items import MatchResultItem
import re, json, string, datetime, uuid
class PreliminarySpider(Spider):
name = "script"
start_urls = [
start_url1,
start_url2,
start_url3,
start_url4,
start_url5,
start_url6,
start_url7,
start_url8,
start_url9,
start_url10,
]
allowed_domains = ['domain.com']
def parse(self, response):
sel = Selector(response)
matches = sel.xpath('match_selector')
for match in matches:
try:
item = MatchResultItem()
item['url'] = match.xpath('match_url_extractor').extract()[0]
except Exception:
print "Unable to get: %s" % match.extract()
yield Request(url=item['url'] ,meta = {'item' : item}, callback=self.parse_additional_info)
def parse_additional_info(self, response):
item = response.request.meta['item']
sel = Selector(response)
try:
item['roun'] = sel.xpath('round_extractor').extract()[0]
item['stadium'] = sel.xpath('stadium_extractor').extract()[0]
item['attendance'] = sel.xpath('attendance_extractor').extract()[0]
except Exception:
print "Attributes not found at:" % item['url']
item['player'] = []
path_player = sel.xpath('path_extractor')
for path in path_player:
player = path.xpath('player_extractor').extract()[0]
player_id = path.xpath('player_d_extractor').extract()[0]
country = path.xpath('country_extractor').extract()[0]
item['player'].append([player_id, player, country])
url = path.xpath('url_extractor').extract()[0]
yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)
# except Exception:
# print "Unable to get players"
yield item
def parse_player(self, response):
item = response.request.meta['item']
sel = Selector(response)
play_id = re.sub("[^0-9]", "",response.url)
name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
item['player'][index[0]][1]=name
return item
EDIT New code:
yield Request(url,meta = {'item' : item}, callback= self.parse_player, errback= self.err_player)
# except Exception:
# print "Unable to get players"
yield item
def parse_player(self, response):
item = response.request.meta['item']
sel = Selector(response)
play_id = re.sub("[^0-9]", "",response.url)
name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
item['player'][index[0]][1]=name
item['player'][index[0]].append("1")
return item
def err_player(self, response):
print "****************"
print "Player not found"
print "****************"
item = response.request.meta['item']
play_id = re.sub("[^0-9]", "",response.url)
index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
item['player'][index[0]].append("1")
return item
passing items across multiple callbacks is very delicate practice. It can work in very simple cases. However you can meet all kind of issues:
Request(..., errback=self.my_parse_err) but it's quite tedious to create 2 callbacks for each request)Request(...., dont_filter=True) and using adding HTTPCACHE_ENABLED=True to settings.py)The safe path, both from development perspective and production perspective, is to create 1 type of item for each type of page. Then combine 2 related items as a post processing.
Please also note that if you have duplicate urls you would probably end up with duplicate data in your items. This will also cause data normalisation issues in the database.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With