Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel request in Scrapy

I am stuck with this problem in Scrapy: I am trying to fill my item in the function parse_additional_info and to do so I need to scrape a bunch of additional url in a second callback parse_player:

for path in path_player:
url = path.xpath('url_extractor').extract()[0]
          yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)

When I do so my understanding is that the requests are executed asynchronously later on, filling item , however the yield item returns it immediately incompletely filled. I know it is not possible to wait for all the yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300) to complete, but how would you solve this problem? i.e. making sure the item yield is done when all the infos from the request have completed.

from scrapy.spiders import Spider, CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
from datetime import datetime
from footscript.items import MatchResultItem
import re, json, string, datetime, uuid

class PreliminarySpider(Spider):
  name = "script"
  start_urls = [
start_url1,
start_url2,
start_url3,
start_url4,
start_url5,
start_url6,
start_url7,
start_url8,
start_url9,
start_url10,
]
  allowed_domains = ['domain.com']

  def parse(self, response):
    sel = Selector(response)
    matches = sel.xpath('match_selector')
    for match in matches:
      try:
        item = MatchResultItem()
        item['url'] = match.xpath('match_url_extractor').extract()[0]
      except Exception:
        print "Unable to get: %s" % match.extract()
      yield Request(url=item['url'] ,meta = {'item' : item}, callback=self.parse_additional_info)

  def parse_additional_info(self, response):
    item = response.request.meta['item']
    sel = Selector(response)

    try:
      item['roun'] = sel.xpath('round_extractor').extract()[0]
      item['stadium'] = sel.xpath('stadium_extractor').extract()[0]
      item['attendance'] = sel.xpath('attendance_extractor').extract()[0]
    except Exception:
      print "Attributes not found at:" % item['url']

    item['player'] = []
    path_player = sel.xpath('path_extractor')
    for path in path_player:
      player = path.xpath('player_extractor').extract()[0]
      player_id = path.xpath('player_d_extractor').extract()[0]
      country = path.xpath('country_extractor').extract()[0]
      item['player'].append([player_id, player, country])
      url = path.xpath('url_extractor').extract()[0]
      yield Request(url,meta = {'item' : item}, callback= self.parse_player, priority = 300)
   # except Exception:
   #   print "Unable to get players"
    yield item

  def parse_player(self, response):
    item = response.request.meta['item']
    sel = Selector(response)
    play_id = re.sub("[^0-9]", "",response.url)
    name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
    index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
    item['player'][index[0]][1]=name
    return item

EDIT New code:

yield Request(url,meta = {'item' : item}, callback= self.parse_player, errback= self.err_player)
   # except Exception:
   #   print "Unable to get players"
    yield item

    def parse_player(self, response):
      item = response.request.meta['item']
      sel = Selector(response)
      play_id = re.sub("[^0-9]", "",response.url)
      name = sel.xpath('//div[@class="fdh-wrap contentheader"]/h1/text()').extract()[0].encode('utf-8').rstrip()
      index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
      item['player'][index[0]][1]=name
      item['player'][index[0]].append("1")
      return item

    def err_player(self, response):
      print "****************"
      print "Player not found"
      print "****************"
      item = response.request.meta['item']
      play_id = re.sub("[^0-9]", "",response.url)
      index = [i for i, row in enumerate(item['player']) if play_id in row[0]]
      item['player'][index[0]].append("1")
      return item
like image 380
vrleboss Avatar asked Jun 11 '26 18:06

vrleboss


1 Answers

passing items across multiple callbacks is very delicate practice. It can work in very simple cases. However you can meet all kind of issues:

  • request fails ( you can fix it with Request(..., errback=self.my_parse_err) but it's quite tedious to create 2 callbacks for each request)
  • second requests has duplicates urls ( you can fix it with Request(...., dont_filter=True) and using adding HTTPCACHE_ENABLED=True to settings.py)

The safe path, both from development perspective and production perspective, is to create 1 type of item for each type of page. Then combine 2 related items as a post processing.

Please also note that if you have duplicate urls you would probably end up with duplicate data in your items. This will also cause data normalisation issues in the database.

like image 193
Frederic Bazin Avatar answered Jun 14 '26 09:06

Frederic Bazin