I need to modify my request url before a response is downloaded. But I'm not able to change it. Even after modifying the request url using request.replace(url=new_url)
, the process_response
prints the non-modified url. Here's the code of the middleware:
def process_request(self, request, spider):
original_url = request.url
new_url= original_url + "hello%20world"
print request.url # This prints the original request url
request=request.replace(url=new_url)
print request.url # This prints the modified url
def process_response(self, request, response, spider):
print request.url # This prints the original request url
print response.url # This prints the original request url
return response
Can anyone please tell me what I'm missing here ?
Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.
It is an old topic, but for anyone who needs it, to pass an extra parameter you must use cb_kwargs , then call the parameter in the parse method. You can refer to this part of the documentation.
You need to set the user agent which Scrapy allows you to do directly. import scrapy class QuotesSpider(scrapy. Spider): # ... user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.
class scrapy.http.Request(*args, **kwargs) [source] ¶ A Request object represents an HTTP request, which is usually generated in the Spider and executed by the Downloader, and thus generating a Response.
Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object. The request object is a HTTP request that generates a response. It has the following class − It is a string that specifies the URL request.
When you start scrapy spider for crawling, it stores response details of each url that spider requested inside response object . The good part about this object is it remains available inside parse method of the spider class.
To change the body of a Request use replace (). A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.
Since you are modifying the request
object in process_request()
- you need to return it:
def process_request(self, request, spider):
# avoid infinite loop by not processing the URL if it contains the desired part
if "hello%20world" in request.url: pass
new_url = request.url + "hello%20world"
request = request.replace(url=new_url)
return request
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With