Author note: You might think that this post is lacking context or information, that is only because I don't know where to start. I'll gladly edit with additional information at your request.
Running scrapy I see the following error amongst all the link I am scraping:
ERROR: Error downloading <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/downloader/__init__.py", line 75, in _deactivate
self.active.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
2016-01-19 15:57:20 [scrapy] INFO: Error while removing request from slot
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 140, in <lambda>
d.addBoth(lambda _: slot.remove_request(request))
File "/Library/Python/2.7/site-packages/scrapy/core/engine.py", line 38, in remove_request
self.inprogress.remove(request)
KeyError: <GET http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html>
When I run scrappy simply on that single URL using:
scrappy shell http://www.fifa.com/fifa-tournaments/players-coaches/people=44630/index.html
No errors are occurring. I am scrapping thousands of similar links with no problem but I see this issue on ~10 links. I am using the default 180
seconds download timeout from scrappy.
I don't see anything wrong with these links in my web browser too.
The parsing is initiated by the request:
request = Request(url_nrd,meta = {'item' : item},callback=self.parse_player,dont_filter=True)
Which is handled in the functions:
def parse_player(self, response):
if response.status == 404:
#doing stuff here
yield item
else:
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
yield request
def parse_more(self, response):
#parsing more stuff here
return item
Also: I didn't change defaults settings for download retries in scrappy (but I don't see any retries in my log files either).
Additional notes:
After my scraping completed and since dont_filter=True
I can see that links that failed to download with the previous error at some point, didn't fail when called in previous and subsequent requests.
Possible answer:
I see that I am getting a KeyError
on one of the spiders and that de-allocation of that spider failed (remove_request
). Is it possible that it is because I am setting dont_filter=True
and doing several requests on the same URL and that the key of the spider seems to be that URL? That the spider was de-allocated by a previous, concurrent request on the same URL?
In that case how to have a unique key per request and not indexed on the URL?
EDIT
I think my code in parse_player
was the problem, I don't know for sure because I edited my code since, but I recall seeing a bad indent on yield request
.
def parse_player(self, response):
if response.status == 404:
#doing stuff here
yield item
else:
paths = sel.xpath('some path extractor here')
for path in paths:
if (some_condition):
#doing stuff there
request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True)
# Bad indent of yield request here!
yield request
Let me know if you think that might have caused the issue.
And if you simply ignore the errors ??
def parse_player(self, response): if response.status == 200: paths = sel.xpath('some path extractor here') for path in paths: if (some_condition): #doing stuff there request = Request(url_new,meta = {'item' : item},callback=self.parse_more,dont_filter=True) # Bad indent of yield request here! yield request
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With