Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use Scrapy to crawl within pages?

I am using Python and Scrapy for this question.

I am attempting to crawl webpage A, which contains a list of links to webpages B1, B2, B3, ... Each B page contains a link to another page, C1, C2, C3, ..., which contains an image.

So, using Scrapy, the idea in pseudo-code is:

links = getlinks(A)
for link in links:
    B = getpage(link)
    C = getpage(B)
    image = getimage(C)

However, I am running into a problem when trying to parse more than one page in Scrapy. Here is my code:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    links = hxs.select('...')

    items = []
    for link in links:
        item = CustomItem()
        item['name'] = link.select('...')
        # TODO: Somehow I need to go two pages deep and extract an image.
        item['image'] = ....

How would I go about doing this?

(Note: My question is similar to Using multiple spiders at in the project in Scrapy but I am unsure how to "return" values from Scrapy's Request objects.)

like image 505
sdasdadas Avatar asked Jun 10 '13 00:06

sdasdadas


1 Answers

In scrapy the parse method needs to return a new Request if you need to issue more requests (useyield as scrapy works well with generators). Inside this request you can set a callback to the desired function (to be recursive just pass parse again). Thats the way to crawl into pages.

You can check this recursive crawler as example

Following your example, the change would be something like this:

def parse(self, response):
    b_pages_links = getlinks(A)
    for link in b_pages_links:
        yield Request(link, callback = self.visit_b_page)

def visit_b_page(self, response):
    url_of_c_page = ...
    yield Request(url_of_c_page, callback = self.visit_c_page)

def visit_c_page(self, response):
    url_of_image = ...
    yield Request(url_of_image, callback = self.get_image)

def get_image(self, response):
    item = CustomItem()
    item['name'] = ... # get image name
    item['image'] = ... # get image data
    yield item

Also check the scrapy documentation and these random code snippets. They can help a lot :)

like image 163
Bruno Penteado Avatar answered Oct 10 '22 11:10

Bruno Penteado