I am using Python and Scrapy for this question.
I am attempting to crawl webpage A, which contains a list of links to webpages B1, B2, B3, ... Each B page contains a link to another page, C1, C2, C3, ..., which contains an image.
So, using Scrapy, the idea in pseudo-code is:
links = getlinks(A)
for link in links:
B = getpage(link)
C = getpage(B)
image = getimage(C)
However, I am running into a problem when trying to parse more than one page in Scrapy. Here is my code:
def parse(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('...')
items = []
for link in links:
item = CustomItem()
item['name'] = link.select('...')
# TODO: Somehow I need to go two pages deep and extract an image.
item['image'] = ....
How would I go about doing this?
(Note: My question is similar to Using multiple spiders at in the project in Scrapy but I am unsure how to "return" values from Scrapy's Request objects.)
In scrapy the parse method needs to return a new Request if you need to issue more requests (useyield
as scrapy works well with generators). Inside this request you can set a callback to the desired function (to be recursive just pass parse
again). Thats the way to crawl into pages.
You can check this recursive crawler as example
Following your example, the change would be something like this:
def parse(self, response):
b_pages_links = getlinks(A)
for link in b_pages_links:
yield Request(link, callback = self.visit_b_page)
def visit_b_page(self, response):
url_of_c_page = ...
yield Request(url_of_c_page, callback = self.visit_c_page)
def visit_c_page(self, response):
url_of_image = ...
yield Request(url_of_image, callback = self.get_image)
def get_image(self, response):
item = CustomItem()
item['name'] = ... # get image name
item['image'] = ... # get image data
yield item
Also check the scrapy documentation and these random code snippets. They can help a lot :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With