Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy - parsing items that are paginated

Tags:

python

scrapy

I have a url of the form:

example.com/foo/bar/page_1.html 

There are a total of 53 pages, each one of them has ~20 rows.

I basically want to get all the rows from all the pages, i.e. ~53*20 items.

I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:

  def parse(self, response):     hxs = HtmlXPathSelector(response)      restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')      for rest in restaurants:       item = DegustaItem()       item['name'] = rest.select('td[2]/a/b/text()').extract()[0]       # some items don't have category associated with them       try:         item['category'] = rest.select('td[3]/a/text()').extract()[0]       except:         item['category'] = ''       item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]        # get profile url       rel_url = rest.select('td[2]/a/@href').extract()[0]       # join with base url since profile url is relative       base_url = get_base_url(response)       follow = urljoin_rfc(base_url,rel_url)        request = Request(follow, callback = parse_profile)       request.meta['item'] = item       return request     def parse_profile(self, response):     item = response.meta['item']     # item['address'] = figure out xpath     return item 

The question is, how do I crawl each page?

example.com/foo/bar/page_1.html example.com/foo/bar/page_2.html example.com/foo/bar/page_3.html ... ... ... example.com/foo/bar/page_53.html 
like image 783
AlexBrand Avatar asked Oct 11 '12 20:10

AlexBrand


People also ask

How do you deal with pagination in BeautifulSoup?

First, we need to import the request HTTP library from Python and BeautifulSoup. Then, we create a variable called isHaveNextPage and a variable called page, the variable useful for tackling pagination later.

Can we use BeautifulSoup in Scrapy?

Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks. You just have to feed the response's body into a BeautifulSoup object and extract whatever data you need from it. BeautifulSoup supports several HTML/XML parsers.


2 Answers

You have two options to solve your problem. The general one is to use yield to generate new requests instead of return. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.

In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:

class MySpider(BaseSpider):     start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)] 
like image 174
Achim Avatar answered Sep 18 '22 15:09

Achim


You could use the CrawlSpider instead of the BaseSpider and use SgmlLinkExtractor to extract the pages in the pagination.

For instance:

start_urls = ["www.example.com/page1"] rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))                 , follow= True),           Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))                 , callback='parse_call')     ) 

The first rule tells scrapy to follow the link contained in the xpath expression, the second rule tells scrapy to call the parse_call to links contained in the xpath expression, in case you want to parse something in each page.

For more info please see the doc: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

like image 34
bslima Avatar answered Sep 19 '22 15:09

bslima