Scrapy CrawlSpider for AJAX content

Tags:

I am attempting to crawl a site for news articles. My start_url contains:

(1) links to each article: http://example.com/symbol/TSLA

and

(2) a "More" button that makes an AJAX call that dynamically loads more articles within the same start_url: http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0&slugs=tsla&is_symbol_page=true

A parameter to the AJAX call is "page", which is incremented each time the "More" button is clicked. For example, clicking "More" once will load an additional n articles and update the page parameter in the "More" button onClick event, so that next time "More" is clicked, "page" two of articles will be loaded (assuming "page" 0 was loaded initially, and "page" 1 was loaded on the first click).

For each "page" I would like to scrape the contents of each article using Rules, but I do not know how many "pages" there are and I do not want to choose some arbitrary m (e.g., 10k). I can't seem to figure out how to set this up.

From this question, Scrapy Crawl URLs in Order, I have tried to create a URL list of potential URLs, but I can't determine how and where to send a new URL from the pool after parsing the previous URL and ensuring it contains news links for a CrawlSpider. My Rules send responses to a parse_items callback, where the article contents are parsed.

Is there a way to observe the contents of the links page (similar to the BaseSpider example) before applying Rules and calling parse_items so that I may know when to stop crawling?

Simplified code (I removed several of the fields I'm parsing for clarity):

class ExampleSite(CrawlSpider):

    name = "so"
    download_delay = 2

    more_pages = True
    current_page = 0

    allowed_domains = ['example.com']

    start_urls = ['http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0'+
                      '&slugs=tsla&is_symbol_page=true']

    ##could also use
    ##start_urls = ['http://example.com/symbol/tsla']

    ajax_urls = []                                                                                                                                                                                                                                                                                                                                                                                                                          
    for i in range(1,1000):
        ajax_urls.append('http://example.com/account/ajax_headlines_content?type=in_focus_articles&page='+str(i)+
                      '&slugs=tsla&is_symbol_page=true')

    rules = (
             Rule(SgmlLinkExtractor(allow=('/symbol/tsla', ))),
             Rule(SgmlLinkExtractor(allow=('/news-article.*tesla.*', '/article.*tesla.*', )), callback='parse_item')
            )

        ##need something like this??
        ##override parse?
        ## if response.body == 'no results':
            ## self.more_pages = False
            ## ##stop crawler??   
        ## else: 
            ## self.current_page = self.current_page + 1
            ## yield Request(self.ajax_urls[self.current_page], callback=self.parse_start_url)


    def parse_item(self, response):

        self.log("Scraping: %s" % response.url, level=log.INFO)

        hxs = Selector(response)

        item = NewsItem()

        item['url'] = response.url
        item['source'] = 'example'
        item['title'] = hxs.xpath('//title/text()')
        item['date'] = hxs.xpath('//div[@class="article_info_pos"]/span/text()')

        yield item

430

asked May 16 '14 23:05

BadgerBadgerBadger

1 Answers

Crawl spider may be too limited for your purposes here. If you need a lot of logic you are usually better off inheriting from Spider.

Scrapy provides CloseSpider exception that can be raised when you need to stop parsing under certain conditions. The page you are crawling returns a message "There are no Focus articles on your stocks", when you exceed maximum page, you can check for this message and stop iteration when this message occurs.

In your case you can go with something like this:

from scrapy.spider import Spider
from scrapy.http import Request
from scrapy.exceptions import CloseSpider

class ExampleSite(Spider):
    name = "so"
    download_delay = 0.1

    more_pages = True
    next_page = 1

    start_urls = ['http://example.com/account/ajax_headlines_content?type=in_focus_articles&page=0'+
                      '&slugs=tsla&is_symbol_page=true']

    allowed_domains = ['example.com']

    def create_ajax_request(self, page_number):
        """
        Helper function to create ajax request for next page.
        """
        ajax_template = 'http://example.com/account/ajax_headlines_content?type=in_focus_articles&page={pagenum}&slugs=tsla&is_symbol_page=true'

        url = ajax_template.format(pagenum=page_number)
        return Request(url, callback=self.parse)

    def parse(self, response):
        """
        Parsing of each page.
        """
        if "There are no Focus articles on your stocks." in response.body:
            self.log("About to close spider", log.WARNING)
            raise CloseSpider(reason="no more pages to parse")


        # there is some content extract links to articles
        sel = Selector(response)
        links_xpath = "//div[@class='symbol_article']/a/@href"
        links = sel.xpath(links_xpath).extract()
        for link in links:
            url = urljoin(response.url, link)
            # follow link to article
            # commented out to see how pagination works
            #yield Request(url, callback=self.parse_item)

        # generate request for next page
        self.next_page += 1
        yield self.create_ajax_request(self.next_page)

    def parse_item(self, response):
        """
        Parsing of each article page.
        """
        self.log("Scraping: %s" % response.url, level=log.INFO)

        hxs = Selector(response)

        item = NewsItem()

        item['url'] = response.url
        item['source'] = 'example'
        item['title'] = hxs.xpath('//title/text()')
        item['date'] = hxs.xpath('//div[@class="article_info_pos"]/span/text()')

        yield item

answered Sep 18 '22 19:09

Pawel Miech

Related questions
                            
                                Celery Task Grouping/Aggregation
                            
                                Writing a GIMP python script
                            
                                Matplotlib backend missing modules with underscore
                            
                                OpenCV 2.4.3 and Python
                            
                                Passing a set of NumPy arrays into C function for input and output
                            
                                getting sphinx to recognise correct signature
                            
                                Django unit test client response has empty context
                            
                                Python converting latin1 to UTF8 [duplicate]
                            
                                Calling python script with subprocess.Popen and flushing the data
                            
                                Celery Task Chain and Accessing **kwargs
                            
                                Returning from __len__() when >64 bits
                            
                                Python file.tell() giving strange numbers?
                            
                                Generate big random sequence of unique numbers [duplicate]
                            
                                python, storing dictionaries inside a dataframe
                            
                                How to generate a distribution with a given mean, variance, skew and kurtosis in Python?
                            
                                Distribution independent libpython path
                            
                                Python Module Imports - Explicit vs Implicit Relative Imports
                            
                                Does python list store the object or the reference to the object?
                            
                                replace rows in a pandas data frame
                            
                                Why are explicit calls to magic methods slower than "sugared" syntax?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy CrawlSpider for AJAX content

Tags:

python

web-scraping

scrapy

BadgerBadgerBadger

People also ask

1 Answers

Pawel Miech

Recent Activity

Donate For Us