So I have a scrapy program I am trying to get off the ground but I can't get my code to execute it always comes out with the error below.
I can still visit the site using the scrapy shell
command so I know the Url's and stuff all work.
Here is my code
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Malscraper.items import MalItem
class MalSpider(CrawlSpider):
name = 'Mal'
allowed_domains = ['www.website.net']
start_urls = ['http://www.website.net/stuff.php?']
rules = [
Rule(LinkExtractor(
allow=['//*[@id="content"]/div[2]/div[2]/div/span/a[1]']),
callback='parse_item',
follow=True)
]
def parse_item(self, response):
mal_list = response.xpath('//*[@id="content"]/div[2]/table/tr/td[2]/')
for mal in mal_list:
item = MalItem()
item['name'] = mal.xpath('a[1]/strong/text()').extract_first()
item['link'] = mal.xpath('a[1]/@href').extract_first()
yield item
Edit: Here is the trace.
Traceback (most recent call last):
File "C:\Users\2015\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
Edit2:
So with the scrapy shell command
I am able to manipulate my responses but I just noticed that the same exact error comes up again when visiting the site
Edit3:
I am now finding that the error shows up on EVERY website I use the shell command
with, but I am able to manipulate the response still.
Edit4:
So how do I verify I am atleast receiving a response from Scrapy when running the crawl command
?
Now I don't know if its my code that is the reason my logs turns up empty or the error ?
Here is my settings.py
BOT_NAME = 'Malscraper'
SPIDER_MODULES = ['Malscraper.spiders']
NEWSPIDER_MODULE = 'Malscraper.spiders'
FEED_URI = 'logs/%(name)s/%(time)s.csv'
FEED_FORMAT = 'csv'
There's an open scrapy issue for this problem: https://github.com/scrapy/scrapy/issues/1054
Although it seems to be just a warning on other platforms.
You can disable the S3DownloadHandler (that is causing this error) by adding to your scrapy settings:
DOWNLOAD_HANDLERS = {
's3': None,
}
you can also remove boto
from the optional packages adding:
from scrapy import optional_features
optional_features.remove('boto')
as suggested in this issue
This is very annoying. What is happening is that you have Null credentials and boto decides to populate them for you from a metadata server (if it exists) using _populate_keys_from_metadata_server()
. See here and here. If you don't run in an EC2 instance, or something that runs a metadata server (listening in the auto-magic IP: 169.254.169.254), the attempt timeouts. This was ok and quiet since scrapy handles the exception here, but unfortunately, boto started logging it here thus, the annoying message. Apart from disabling the s3 as said before... which looks a bit scary, you can achieve similar results by just setting the credentials to an empty string.
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With