Actually I want to store all the data(text,hrefs,images) related to specific website to a single folder.In order to do that I need to pass the path for that folder to all different parsing function.So I want to pass this path as extra kwargs in scrapy.Request()
like this:
yield scrapy.Request(url=url,dont_filter=True, callback=self.parse,errback = self.errback_function,kwargs={'path': '/path/to_folder'})
But it gives the error TypeError: __init__() got an unexpected keyword argument 'kwargs'
How can I pass that path to next function?
The spider will receive arguments in its constructor. Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break.
Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
log(cook1) self. log("end cookie2") return Request("http://something.net/some/sa/"+response.headers.getlist('Location')[0],cookies={cook1[0]:cook1[1]}, callback=self. check_login_response) . . .
For anyone who may need it......
You can pass extra arguments by using meta
arguments like this...
yield scrapy.Request(url=url,dont_filter=True,
callback=self.parse,errback = self.errback_function, meta={'filepath': filepath})
UPDATE:
Request.cb_kwargs
was introduced in version 1.7. Prior to that, using Request.meta was recommended for passing information around callbacks. After 1.7, Request.cb_kwargs became the preferred way for handling user information, leaving Request.meta for communication with components like middlewares and extensions.
So for version >= 1.7 following would work :
request = scrapy.Request('http://www.example.com/index.html', callback=self.parse_page2, cb_kwargs=dict(main_url=response.url))
you can refer to this documentation: https://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With