Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy gives URLError: <urlopen error timed out>

So I have a scrapy program I am trying to get off the ground but I can't get my code to execute it always comes out with the error below.

I can still visit the site using the scrapy shell command so I know the Url's and stuff all work.

Here is my code

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Malscraper.items import MalItem

class MalSpider(CrawlSpider):
  name = 'Mal'
  allowed_domains = ['www.website.net']
  start_urls = ['http://www.website.net/stuff.php?']
  rules = [
    Rule(LinkExtractor(
        allow=['//*[@id="content"]/div[2]/div[2]/div/span/a[1]']),
        callback='parse_item',
        follow=True)
  ]

  def parse_item(self, response):
    mal_list = response.xpath('//*[@id="content"]/div[2]/table/tr/td[2]/')

    for mal in mal_list:
      item = MalItem()
      item['name'] = mal.xpath('a[1]/strong/text()').extract_first()
      item['link'] = mal.xpath('a[1]/@href').extract_first()

      yield item

Edit: Here is the trace.

Traceback (most recent call last):
  File "C:\Users\2015\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 449, in _open
    '_open', req)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1197, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>

Edit2:

So with the scrapy shell command I am able to manipulate my responses but I just noticed that the same exact error comes up again when visiting the site

Edit3:

I am now finding that the error shows up on EVERY website I use the shell command with, but I am able to manipulate the response still.

Edit4: So how do I verify I am atleast receiving a response from Scrapy when running the crawl command? Now I don't know if its my code that is the reason my logs turns up empty or the error ?

Here is my settings.py

BOT_NAME = 'Malscraper'

SPIDER_MODULES = ['Malscraper.spiders']
NEWSPIDER_MODULE = 'Malscraper.spiders'
FEED_URI = 'logs/%(name)s/%(time)s.csv'
FEED_FORMAT = 'csv'
like image 232
grasshopper Avatar asked Jun 25 '15 10:06

grasshopper


3 Answers

There's an open scrapy issue for this problem: https://github.com/scrapy/scrapy/issues/1054

Although it seems to be just a warning on other platforms.

You can disable the S3DownloadHandler (that is causing this error) by adding to your scrapy settings:

DOWNLOAD_HANDLERS = {
  's3': None,
}
like image 59
José Ricardo Avatar answered Oct 19 '22 16:10

José Ricardo


you can also remove boto from the optional packages adding:

from scrapy import optional_features
optional_features.remove('boto')

as suggested in this issue

like image 36
guilhermerama Avatar answered Oct 19 '22 16:10

guilhermerama


This is very annoying. What is happening is that you have Null credentials and boto decides to populate them for you from a metadata server (if it exists) using _populate_keys_from_metadata_server(). See here and here. If you don't run in an EC2 instance, or something that runs a metadata server (listening in the auto-magic IP: 169.254.169.254), the attempt timeouts. This was ok and quiet since scrapy handles the exception here, but unfortunately, boto started logging it here thus, the annoying message. Apart from disabling the s3 as said before... which looks a bit scary, you can achieve similar results by just setting the credentials to an empty string.

AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""
like image 1
neverlastn Avatar answered Oct 19 '22 16:10

neverlastn