scrapy - parsing items that are paginated

Tags:

2 Answers

You have two options to solve your problem. The general one is to use yield to generate new requests instead of return. That way you can issue more than one new request from a single callback. Check the second example at http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example.

In your case there is probably a simpler solution: Just generate the list of start urs from a patter like this:

class MySpider(BaseSpider):     start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]

174

answered Sep 18 '22 15:09

Achim

You could use the CrawlSpider instead of the BaseSpider and use SgmlLinkExtractor to extract the pages in the pagination.

For instance:

start_urls = ["www.example.com/page1"] rules = ( Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@class="next_page"]',))                 , follow= True),           Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="foto_imovel"]',))                 , callback='parse_call')     )

The first rule tells scrapy to follow the link contained in the xpath expression, the second rule tells scrapy to call the parse_call to links contained in the xpath expression, in case you want to parse something in each page.

For more info please see the doc: http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

answered Sep 19 '22 15:09

bslima

Related questions
                            
                                Combining hdf5 files
                            
                                Unit test script returns exit code = 0 even if tests fail
                            
                                In python, is there a "pass" equivalent for a variable assignment
                            
                                Python for loop question
                            
                                Python unsubscriptable
                            
                                Is there an advantage to using Bash over Perl or Python? [closed]
                            
                                How to get min, seconds and milliseconds from datetime.now() in python?
                            
                                How to print values separated by spaces instead of new lines in Python 2.7
                            
                                Fastest way to generate a random-like unique string with random length in Python 3
                            
                                Unexpected keyword argument 'ragged' in Keras
                            
                                How to check if something exists in a postgresql database using django?
                            
                                'negative' pattern matching in python
                            
                                how to share a variable across modules for all tests in py.test
                            
                                Arguments that are dependent on other arguments with Argparse
                            
                                Python Flask-Restful POST not taking JSON arguments
                            
                                Why would I want to use itertools.islice instead of normal list slicing?
                            
                                Generating 15 minute time interval array in python
                            
                                Filter out nan rows in a specific column
                            
                                Unable to locate package python-pip Ubuntu 20.04
                            
                                How to find an index at which a new item can be inserted into sorted list and keep it sorted?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scrapy - parsing items that are paginated

Tags:

python

scrapy

AlexBrand

People also ask

2 Answers

Achim

bslima

Recent Activity

Donate For Us