I'm trying to write a very simple website crawler to list URLs along with referrer and status codes for 200, 301, 302 and 404 http status codes.
Turns out that Scrapy works great and my script uses it correctly to crawl the website and can list urls with 200 and 404 status codes without problems.
The problem is: I can't find how to have scrapy follow redirects AND parse/output them. I can get one to work but not both.
What I've tried so far:
setting meta={'dont_redirect':True}
and setting REDIRECTS_ENABLED = False
adding 301, 302 to handle_httpstatus_list
changing settings specified in the redirect middleware doc
reading the redirect middleware code for insight
various combo of all of the above
other random stuff
Here's the public repo if you want to take a look at the code.
If you want to parse 301 and 302 responses, and follow them at the same time, ask for 301 and 302 to be processed by your callback and mimick the behavior of RedirectMiddleware.
Let's illustrate with a simple spider to start with (not working as you intend yet):
import scrapy
class HandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
def parse(self, response):
self.logger.info("got response for %r" % response.url)
Right now, the spider asks for 2 pages, and the 2nd one should redirect to http://www.example.com
$ scrapy runspider test.py
2016-09-30 11:28:17 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:28:18 [scrapy] DEBUG: Redirecting (302) to <GET http://example.com/> from <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>
2016-09-30 11:28:18 [handle] INFO: got response for 'https://httpbin.org/get'
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)
2016-09-30 11:28:18 [handle] INFO: got response for 'http://example.com/'
2016-09-30 11:28:18 [scrapy] INFO: Spider closed (finished)
The 302 is handled by RedirectMiddleware
automatically and it does not get passed to your callback.
Let's configure the spider to handle 301 and 302s in the callback, using handle_httpstatus_list
:
import scrapy
class HandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
handle_httpstatus_list = [301, 302]
def parse(self, response):
self.logger.info("got response %d for %r" % (response.status, response.url))
Let's run it:
$ scrapy runspider test.py
2016-09-30 11:33:32 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:33:33 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:33:33 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:33:33 [scrapy] INFO: Spider closed (finished)
Here, we're missing the redirection.
Do the same as RedirectMiddleware but in the spider callback:
from six.moves.urllib.parse import urljoin
import scrapy
from scrapy.utils.python import to_native_str
class HandleSpider(scrapy.Spider):
name = "handle"
start_urls = (
'https://httpbin.org/get',
'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
)
handle_httpstatus_list = [301, 302]
def parse(self, response):
self.logger.info("got response %d for %r" % (response.status, response.url))
# do something with the response here...
# handle redirection
# this is copied/adapted from RedirectMiddleware
if response.status >= 300 and response.status < 400:
# HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
location = to_native_str(response.headers['location'].decode('latin1'))
# get the original request
request = response.request
# and the URL we got redirected to
redirected_url = urljoin(request.url, location)
if response.status in (301, 307) or request.method == 'HEAD':
redirected = request.replace(url=redirected_url)
yield redirected
else:
redirected = request.replace(url=redirected_url, method='GET', body='')
redirected.headers.pop('Content-Type', None)
redirected.headers.pop('Content-Length', None)
yield redirected
And run the spider again:
$ scrapy runspider test.py
2016-09-30 11:45:20 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:45:21 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F)
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'http://example.com/'
2016-09-30 11:45:21 [scrapy] INFO: Spider closed (finished)
We got redirected to http://www.example.com and we also got the response through our callback.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With