I am new to Python and Scrapy. I have not used callback functions before. However, I do now for the code below. The first request will be executed and the response of that will be sent to the callback function defined as second argument:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
I am unable to understand following things:
item
populated?request.meta
line executes before the response.meta
line in parse_page2
?item
from parse_page2
going?return request
statement in parse_page1
? I thought the extracted items need to be returned from here.Read the docs:
For spiders, the scraping cycle goes through something like this:
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the
start_requests()
method which (by default) generatesRequest
for the URLs specified in thestart_urls
and theparse
method as callback function for the Requests.In the callback function, you parse the response (web page) and return either
Item
objects,Request
objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data.
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports.
Answers:
How is the
'item'
populated does therequest.meta
line executes beforeresponse.meta
line inparse_page2
?
Spiders are managed by Scrapy engine. It first makes requests from URLs specified in start_urls
and passes them to a downloader. When downloading finishes callback specified in the request is called. If the callback returns another request, the same thing is repeated. If the callback returns an Item
, the item is passed to a pipeline to save the scraped data.
Where is the returned item from
parse_page2
going?What is the need of
return request
statement inparse_page1
? I thought the extracted items need to be returned from here ?
As stated in the docs, each callback (both parse_page1
and parse_page2
) can return either a Request
or an Item
(or an iterable of them). parse_page1
returns a Request
not the Item
, because additional info needs to be scraped from additional URL. Second callback parse_page2
returns an item, because all the info is scraped and ready to be passed to a pipeline.
parse_page1
and avoid the extra http request callIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With