I want to create a crawler that will scrape some data from Trip Advisor. Ideally, it will (a) identify the links to all locations to crawl, (b) collect links to all attractions in each location and (c) will collect the destination names, dates and ratings for all reviews. I'd like to focus on part (a) for now.
Here is the website I'm starting off with: http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html
There is problem here: the link gives top 10 destinations to begin with, and if you then click on "See more popular destinations" it will expand the list. It appears as though it uses a javascript function to achieve this. Unfortunately, I'm not familiar with javascript, but I think the following chunk may give clues about how it works:
<div class="morePopularCities" onclick="ta.call('ta.servlet.Tourism.showNextChildPage', event, this)">
<img id='lazyload_2067453571_25' height='27' width='27' src='http://e2.tacdn.com/img2/x.gif'/>
See more popular destinations in New Zealand </div>
I've found a few useful webscraping packages for R, such as rvest, RSelenium, XML, RCurl, but of these, only RSelenium appears to be able to resolve this, having said that, I still haven't been able to work it out.
Here is some relevant code:
tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"
RSelenium::startServer()
remDr = RSelenium::remoteDriver(browserName = "internet explorer")
remDr$open()
remDr$navigate(tu)
# remDr$executeScript("JS_FUNCTION")
The last line should do the trick here, but I'm not sure what function I need to call here.
Once I manage to expand this list, I will be able to obtain the links for each destination the same way I would solve part (b) and I think I've already solved this (for those interested):
library(rvest)
tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"
tu = html_session(tu)
tu %>% html_nodes(xpath='//div[@class="popularCities"]/a') %>% html_attr("href")
[1] "/Tourism-g255122-Queenstown_Otago_Region_South_Island-Vacations.html"
[2] "/Tourism-g255106-Auckland_North_Island-Vacations.html"
[3] "/Tourism-g255117-Blenheim_Marlborough_Region_South_Island-Vacations.html"
[4] "/Tourism-g255111-Rotorua_Rotorua_District_Bay_of_Plenty_Region_North_Island-Vacations.html"
[5] "/Tourism-g255678-Nelson_Nelson_Tasman_Region_South_Island-Vacations.html"
[6] "/Tourism-g255113-Taupo_Taupo_District_Waikato_Region_North_Island-Vacations.html"
[7] "/Tourism-g255109-Napier_Hawke_s_Bay_Region_North_Island-Vacations.html"
[8] "/Tourism-g612500-Wanaka_Otago_Region_South_Island-Vacations.html"
[9] "/Tourism-g255679-Russell_Bay_of_Islands_Northland_Region_North_Island-Vacations.html"
[10] "/Tourism-g255114-Tauranga_Bay_of_Plenty_Region_North_Island-Vacations.html"
As for step (c), I've found some useful links that might be helpful for that: https://github.com/hadley/rvest/blob/master/demo/tripadvisor.R http://notesofdabbler.github.io/201408_hotelReview/scrapeTripAdvisor.html
If you have any tips on how to expand the list of top destinations or how to go through the other steps in a smarter way, please let me know, I'd be really keen to hear from you.
Many thanks in advance!
In addition to travel, Tripadvisor is home to tons of useful data from flight prices, hotel prices, popular destinations, and more–even data indicators of what's trending or what has the potential to trend. Web Scraping, the automatic extraction of data from web pages, can be used to scrape Tripadvisor for this data.
Because of R's built-in tools and libraries, web scraping in R is both easy and scalable. That's why it should be no surprise that it's one of the most popular programming languages in the data science community.
In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.
Basically, you can try to send a click event to the <div class="morePopularCities">
. Something like this :
remDr$navigate(tu)
div <- remDr$findElement("class", "morePopularCities")
div$clickElement()
To expand all locations, you can possibly repeat the above logic in a while loop. Keep clicking on the <div>
until no more items available (until the div
no longer in the page) :
divs <- remDr$findElements("class", "morePopularCities")
while(length(divs )>0) {
for(div in divs ){
div$clickElement()
}
divs <- remDr$findElements("class", "morePopularCities")
}
I'm not fluent in R
, you may find my code example not pretty, feel free to suggest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With