Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to integrate scrapyjs function into a Scrapy project

I made a web scraper using the Scrapy Framework to get concert ticket data from this website. I have been able to successfully scrape data from elements inside of each ticket that is on the page, except for the price which can only be accessed by clicking the "tickets" button to go to the tickets page and scraping a ticket price from a ticket on the page.

After extensive Googling, I found that Scrapy.js (which is based in Splash) can be used within Scrapy to interact with JavaScript on the page (such as the button that needs to be clicked). I have seen some basic examples of how Splash is used to interact with JavaScript, but none of them have featured Splash's integration with Scrapy (not even in the docs).

I've been following the format of using item loaders to store the scraped elements in a parse method and then making a request which is supposed go to another link and parse the html from that page by calling a second parse method

(e.g. yield scrapy.Request(next_link, callback=self.parse_price)

but the code for this would change somewhat now that I will be using Scrapy js. To incorporate the Scrapyjs, I was thinking of using functions similar to this:

function main(splash)
  splash:go("http://example.com")
  splash:wait(0.5)
  local title = splash:evaljs("document.title")
return {title=title}

from this site but since javascript can't be written directly inside of a python program, how/where would I even incorporate that kind of function in the program in order to be able to navigate to the next page by clicking the button and parse the html? I'm obviously very new at web scraping so any help at all would be greatly appreciated. The code for the spider is below:

concert_ticket_spider.py

from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from concert_comparator.items import ComparatorItem

bandname = raw_input("Enter a bandname \n")
vs_url = "http://www.vividseats.com/concerts/" + bandname + "-tickets.html"

class MySpider(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.vividseats.com"]
    start_urls = [vs_url]
    #rules = (Rule(LinkExtractor(allow=('/' + bandname + '-.*', )), callback='parse_price'))
    # item = ComparatorItem()
    tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
    item_fields = {
        'eventName' : './/*[@class="productionsEvent"]/text()',
        'eventLocation' : './/*[@class = "productionsVenue"]/span[@itemprop  = "name"]/text()',
        'ticketsLink' : './/a/@href',
        'eventDate' : './/*[@class = "productionsDate"]/text()',
        'eventCity' : './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressLocality"]/text()',
        'eventState' : './/*[@class = "productionsVenue"]/span[@itemprop  = "address"]/span[@itemprop  = "addressRegion"]/text()',
        'eventTime' : './/*[@class = "productionsTime"]/text()'
    }


    item_fields2 = {
            'ticketPrice' : '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price]',


   }
    def parse_price(self, response):
            l.add_xpath('ticketPrice','.//*[@class =  "price"]/text()' )
            yield l.load_item()


        def parse(self, response):
            """

            """

        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):

            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader
            for field, xpath in self.item_fields.iteritems():
                loader.add_xpath(field, xpath)
                yield Request(vs_url, self.parse_result, meta= {
                    'splash': {
                        'args':{
                            #set rendering arguments here
                            'html' :1

                            # 'url' is prefilled from request url
                        },
                        #optional parameters
                        function main(splash)
                            splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
                            splash:go(vs_url)
                            splash:runjs("$('#some-button').click()")
                            return splash:html()
                        end                    
                        }
                    })
                for field, xpath in self.item_fields2.iteritems():
                    loader.add_xpath(field, xpath)

            yield loader.load_item()
like image 788
loremIpsum1771 Avatar asked Jun 29 '15 23:06

loremIpsum1771


1 Answers

The key point here is that scrapyjs provides a scrapyjs.SplashMiddleware middleware that you need to configure. Then, every request that would have a splash meta key would be processed by the middleware.

FYI, I've personally successfully used Scrapy with scrapyjs before.

like image 136
alecxe Avatar answered Oct 17 '22 00:10

alecxe