Scrapy crawl with next page

Tags:

I have this code for scrapy framework:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html

class Scrapy1Spider(scrapy.Spider):
    name = "scrapy1"
    allowed_domains = ["sfbay.craigslist.org"]
    start_urls = (
        'http://sfbay.craigslist.org/search/npo',
    )

    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),)

    def parse(self, response):
        site = html.fromstring(response.body_as_unicode())
        titles = site.xpath('//div[@class="content"]/p[@class="row"]')
        print len(titles), 'AAAA'

But problem is that i get 100 results, it doesn't go to next pages.

What is wrong here?

859

asked Sep 17 '15 06:09

Mirza Delic

1 Answers

Your rule is not used because you don't use a CrawlSpider.

So you have to create the next page requests manually like so:

Click to copy

# -*- coding: utf-8 -*-
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html

class Scrapy1Spider(scrapy.Spider):
    name = "craiglist"
    allowed_domains = ["sfbay.craigslist.org"]
    start_urls = (
        'http://sfbay.craigslist.org/search/npo',
    )

    Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),)

    def parse(self, response):
        site = html.fromstring(response.body_as_unicode())
        titles = site.xpath('//div[@class="content"]/p[@class="row"]')
        print len(titles), 'AAAA'

        # follow next page links
        next_page = response.xpath('.//a[@class="button next"]/@href').extract()
        if next_page:
            next_href = next_page[0]
            next_page_url = 'http://sfbay.craigslist.org' + next_href
            request = scrapy.Request(url=next_page_url)
            yield request

Or use the CrawlSpider like so:

Click to copy

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html

class Scrapy1Spider(CrawlSpider):
    name = "craiglist"
    allowed_domains = ["sfbay.craigslist.org"]
    start_urls = (
        'http://sfbay.craigslist.org/search/npo',
    )

    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_page", follow= True),)

    def parse_page(self, response):
        site = html.fromstring(response.body_as_unicode())
        titles = site.xpath('//div[@class="content"]/p[@class="row"]')
        print len(titles), 'AAAA'

answered Oct 18 '22 03:10

Frank Martin

Related questions
                            
                                Get minimum point(s) of numpy.poly1d curve
                            
                                Regex to extract between two strings (which are variables)
                            
                                Django: How do I use is_active of auth_user table?
                            
                                Pybabel generates empty pot file with jinja2
                            
                                Add new key value pair to existing Firebase
                            
                                Where does cython pyximport compile?
                            
                                main loop 'builtin_function_or_method' object is not iterable
                            
                                How to pass OpenCV image to Tesseract in python?
                            
                                Pandas guess delimiter with sep=None
                            
                                Reindex a dataframe with duplicate index values
                            
                                Elliptic curve point addition over a finite field in Python
                            
                                Strange Python memory usage with Scapy
                            
                                How to add unicode character before a string? [Python]
                            
                                How to remove the trailing comma from a loop in Python?
                            
                                Run unittests from a different file
                            
                                How can I efficently pad an RGB numpy array with the median of the image?
                            
                                date in Flask URL
                            
                                TypeError: Can't instantiate abstract class in Python
                            
                                How to make Python's multiprocessing Queue's .empty() method return the correct value? Or alternatives?
                            
                                How to check request method inside Python class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy crawl with next page

Tags:

python

lxml

scrapy

scrapy-spider

Mirza Delic

People also ask

1 Answers

Frank Martin

Recent Activity

Donate For Us