I have a url of the form:
example.com/foo/bar/page_1.html
There are a total of 53 pages, each one of them has ~20 rows.
I basically want to get all the rows from all the pages, i.e. ~53*20 items.
I have working code in my parse method, that parses a single page, and also goes one page deeper per item, to get more info about the item:
def parse(self, response): hxs = HtmlXPathSelector(response) restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]') for rest in restaurants: item = DegustaItem() item['name'] = rest.select('td[2]/a/b/text()').extract()[0] # some items don't have category associated with them try: item['category'] = rest.select('td[3]/a/text()').extract()[0] except: item['category'] = '' item['urbanization'] = rest.select('td[4]/a/text()').extract()[0] # get profile url rel_url = rest.select('td[2]/a/@href').extract()[0] # join with base url since profile url is relative base_url = get_base_url(response) follow = urljoin_rfc(base_url,rel_url) request = Request(follow, callback = parse_profile) request.meta['item'] = item return request def parse_profile(self, response): item = response.meta['item'] # item['address'] = figure out xpath return item
The question is, how do I crawl each page?
example.com/foo/bar/page_1.html example.com/foo/bar/page_2.html example.com/foo/bar/page_3.html ... ... ... example.com/foo/bar/page_53.html
First, we need to import the request HTTP library from Python and BeautifulSoup. Then, we create a variable called isHaveNextPage and a variable called page, the variable useful for tackling pagination later.
Yes, you can. As mentioned above, BeautifulSoup can be used for parsing HTML responses in Scrapy callbacks. You just have to feed the response's body into a BeautifulSoup object and extract whatever data you need from it. BeautifulSoup supports several HTML/XML parsers.
You have two options to solve your problem. The general one is to use yield
to generate new requests instead of return
. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.
In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:
class MySpider(BaseSpider): start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]
You could use the CrawlSpider instead of the BaseSpider and use SgmlLinkExtractor to extract the pages in the pagination.
For instance:
start_urls = ["www.example.com/page1"] rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',)) , follow= True), Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',)) , callback='parse_call') )
The first rule tells scrapy to follow the link contained in the xpath expression, the second rule tells scrapy to call the parse_call to links contained in the xpath expression, in case you want to parse something in each page.
For more info please see the doc: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With