Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to change request url before making request in scrapy?

I need to modify my request url before a response is downloaded. But I'm not able to change it. Even after modifying the request url using request.replace(url=new_url), the process_response prints the non-modified url. Here's the code of the middleware:

def process_request(self, request, spider):
    original_url = request.url
    new_url= original_url + "hello%20world"
    print request.url            # This prints the original request url
    request=request.replace(url=new_url)
    print request.url            # This prints the modified url

def process_response(self, request, response, spider):
    print request.url            # This prints the original request url
    print response.url           # This prints the original request url
    return response

Can anyone please tell me what I'm missing here ?

like image 346
Rahul Avatar asked Dec 23 '15 13:12

Rahul


People also ask

How do I make a Scrapy request?

Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.

How do I pass parameters in Scrapy request?

It is an old topic, but for anyone who needs it, to pass an extra parameter you must use cb_kwargs , then call the parameter in the parse method. You can refer to this part of the documentation.

How do you set a header in Scrapy?

You need to set the user agent which Scrapy allows you to do directly. import scrapy class QuotesSpider(scrapy. Spider): # ... user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.

What is a HTTP request in Scrapy?

class scrapy.http.Request(*args, **kwargs) [source] ¶ A Request object represents an HTTP request, which is usually generated in the Spider and executed by the Downloader, and thus generating a Response.

How to crawl websites with scrapy?

Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object. The request object is a HTTP request that generates a response. It has the following class − It is a string that specifies the URL request.

Where does Scrapy Spider store the URL response details?

When you start scrapy spider for crawling, it stores response details of each url that spider requested inside response object . The good part about this object is it remains available inside parse method of the spider class.

How to change the body of a Scrapy request?

To change the body of a Request use replace (). A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled.


1 Answers

Since you are modifying the request object in process_request() - you need to return it:

def process_request(self, request, spider): 
    # avoid infinite loop by not processing the URL if it contains the desired part
    if "hello%20world" in request.url: pass 

    new_url = request.url + "hello%20world"
    request = request.replace(url=new_url) 
    return request
like image 69
alecxe Avatar answered Nov 15 '22 00:11

alecxe