I made a web scraper using the Scrapy Framework to get concert ticket data from this website. I have been able to successfully scrape data from elements inside of each ticket that is on the page, except for the price which can only be accessed by clicking the "tickets" button to go to the tickets page and scraping a ticket price from a ticket on the page.
After extensive Googling, I found that Scrapy.js (which is based in Splash) can be used within Scrapy to interact with JavaScript on the page (such as the button that needs to be clicked). I have seen some basic examples of how Splash is used to interact with JavaScript, but none of them have featured Splash's integration with Scrapy (not even in the docs).
I've been following the format of using item loaders to store the scraped elements in a parse method and then making a request which is supposed go to another link and parse the html from that page by calling a second parse method
(e.g. yield scrapy.Request(next_link, callback=self.parse_price)
but the code for this would change somewhat now that I will be using Scrapy js. To incorporate the Scrapyjs, I was thinking of using functions similar to this:
function main(splash)
splash:go("http://example.com")
splash:wait(0.5)
local title = splash:evaljs("document.title")
return {title=title}
from this site but since javascript can't be written directly inside of a python program, how/where would I even incorporate that kind of function in the program in order to be able to navigate to the next page by clicking the button and parse the html? I'm obviously very new at web scraping so any help at all would be greatly appreciated. The code for the spider is below:
from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from concert_comparator.items import ComparatorItem
bandname = raw_input("Enter a bandname \n")
vs_url = "http://www.vividseats.com/concerts/" + bandname + "-tickets.html"
class MySpider(CrawlSpider):
handle_httpstatus_list = [416]
name = 'comparator'
allowed_domains = ["www.vividseats.com"]
start_urls = [vs_url]
#rules = (Rule(LinkExtractor(allow=('/' + bandname + '-.*', )), callback='parse_price'))
# item = ComparatorItem()
tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
item_fields = {
'eventName' : './/*[@class="productionsEvent"]/text()',
'eventLocation' : './/*[@class = "productionsVenue"]/span[@itemprop = "name"]/text()',
'ticketsLink' : './/a/@href',
'eventDate' : './/*[@class = "productionsDate"]/text()',
'eventCity' : './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressLocality"]/text()',
'eventState' : './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressRegion"]/text()',
'eventTime' : './/*[@class = "productionsTime"]/text()'
}
item_fields2 = {
'ticketPrice' : '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price]',
}
def parse_price(self, response):
l.add_xpath('ticketPrice','.//*[@class = "price"]/text()' )
yield l.load_item()
def parse(self, response):
"""
"""
selector = HtmlXPathSelector(response)
# iterate over tickets
for ticket in selector.select(self.tickets_list_xpath):
loader = XPathItemLoader(ComparatorItem(), selector=ticket)
# define loader
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
for field, xpath in self.item_fields.iteritems():
loader.add_xpath(field, xpath)
yield Request(vs_url, self.parse_result, meta= {
'splash': {
'args':{
#set rendering arguments here
'html' :1
# 'url' is prefilled from request url
},
#optional parameters
function main(splash)
splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
splash:go(vs_url)
splash:runjs("$('#some-button').click()")
return splash:html()
end
}
})
for field, xpath in self.item_fields2.iteritems():
loader.add_xpath(field, xpath)
yield loader.load_item()
The key point here is that scrapyjs
provides a scrapyjs.SplashMiddleware
middleware that you need to configure. Then, every request that would have a splash
meta key would be processed by the middleware.
FYI, I've personally successfully used Scrapy
with scrapyjs
before.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With