Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy download error and remove_request error

Author note: You might think that this post is lacking context or information, that is only because I don't know where to start. I'll gladly edit with additional information at your request.


Running scrapy I see the following error amongst all the link I am scraping:

ERROR: Error downloading <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 75, in _deactivate
    self.active.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
2016-01-19 15:57:20 [scrapy] INFO: Error while removing request from slot
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 140, in <lambda>
    d.addBoth(lambda _: slot.remove_request(request))
  File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 38, in remove_request
    self.inprogress.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>

When I run scrappy simply on that single URL using:

scrappy shell http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html

No errors are occurring. I am scrapping thousands of similar links with no problem but I see this issue on ~10 links. I am using the default 180 seconds download timeout from scrappy. I don't see anything wrong with these links in my web browser too.

The parsing is initiated by the request:

  request = Request(url_nrd,meta = {'item' : item},callback=self.parse_player,dont_filter=True)

Which is handled in the functions:

  def parse_player(self, response):
    if response.status == 404:
       #doing stuff here
      yield item
    else:
      #doing stuff there
      request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
      yield request

  def parse_more(self, response):
    #parsing more stuff here
    return item

Also: I didn't change defaults settings for download retries in scrappy (but I don't see any retries in my log files either).

Additional notes: After my scraping completed and since dont_filter=True I can see that links that failed to download with the previous error at some point, didn't fail when called in previous and subsequent requests.

Possible answer: I see that I am getting a KeyError on one of the spiders and that de-allocation of that spider failed (remove_request). Is it possible that it is because I am setting dont_filter=True and doing several requests on the same URL and that the key of the spider seems to be that URL? That the spider was de-allocated by a previous, concurrent request on the same URL?

In that case how to have a unique key per request and not indexed on the URL?


EDIT

I think my code in parse_player was the problem, I don't know for sure because I edited my code since, but I recall seeing a bad indent on yield request.

  def parse_player(self, response):
    if response.status == 404:
       #doing stuff here
      yield item
    else:
      paths = sel.xpath('some path extractor here')
      for path in paths:
        if (some_condition):
          #doing stuff there
          request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
        # Bad indent of yield request here!
        yield request

Let me know if you think that might have caused the issue.

like image 967
vrleboss Avatar asked Jan 20 '16 00:01

vrleboss


1 Answers

And if you simply ignore the errors ??

 def parse_player(self, response):
    if response.status == 200:
      paths = sel.xpath('some path extractor here')
      for path in paths:
        if (some_condition):
          #doing stuff there
          request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
        # Bad indent of yield request here!
        yield request
like image 144
Elinaldo Monteiro Avatar answered Sep 28 '22 08:09

Elinaldo Monteiro