Scraping data from TripAdvisor using R

Tags:

I want to create a crawler that will scrape some data from Trip Advisor. Ideally, it will (a) identify the links to all locations to crawl, (b) collect links to all attractions in each location and (c) will collect the destination names, dates and ratings for all reviews. I'd like to focus on part (a) for now.

Here is the website I'm starting off with: http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html

There is problem here: the link gives top 10 destinations to begin with, and if you then click on "See more popular destinations" it will expand the list. It appears as though it uses a javascript function to achieve this. Unfortunately, I'm not familiar with javascript, but I think the following chunk may give clues about how it works:

<div class="morePopularCities" onclick="ta.call('ta.servlet.Tourism.showNextChildPage', event, this)">
<img id='lazyload_2067453571_25' height='27' width='27' src='http://e2.tacdn.com/img2/x.gif'/>
See more popular destinations in New Zealand </div>

I've found a few useful webscraping packages for R, such as rvest, RSelenium, XML, RCurl, but of these, only RSelenium appears to be able to resolve this, having said that, I still haven't been able to work it out.

Here is some relevant code:

tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"
RSelenium::startServer()
remDr = RSelenium::remoteDriver(browserName = "internet explorer")
remDr$open()
remDr$navigate(tu)
# remDr$executeScript("JS_FUNCTION")

The last line should do the trick here, but I'm not sure what function I need to call here.

Once I manage to expand this list, I will be able to obtain the links for each destination the same way I would solve part (b) and I think I've already solved this (for those interested):

library(rvest)
tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"
tu = html_session(tu)
tu %>% html_nodes(xpath='//div[@class="popularCities"]/a') %>% html_attr("href")
 [1] "/Tourism-g255122-Queenstown_Otago_Region_South_Island-Vacations.html"                      
 [2] "/Tourism-g255106-Auckland_North_Island-Vacations.html"                                     
 [3] "/Tourism-g255117-Blenheim_Marlborough_Region_South_Island-Vacations.html"                  
 [4] "/Tourism-g255111-Rotorua_Rotorua_District_Bay_of_Plenty_Region_North_Island-Vacations.html"
 [5] "/Tourism-g255678-Nelson_Nelson_Tasman_Region_South_Island-Vacations.html"                  
 [6] "/Tourism-g255113-Taupo_Taupo_District_Waikato_Region_North_Island-Vacations.html"          
 [7] "/Tourism-g255109-Napier_Hawke_s_Bay_Region_North_Island-Vacations.html"                    
 [8] "/Tourism-g612500-Wanaka_Otago_Region_South_Island-Vacations.html"                          
 [9] "/Tourism-g255679-Russell_Bay_of_Islands_Northland_Region_North_Island-Vacations.html"      
[10] "/Tourism-g255114-Tauranga_Bay_of_Plenty_Region_North_Island-Vacations.html"

As for step (c), I've found some useful links that might be helpful for that: https://github.com/hadley/rvest/blob/master/demo/tripadvisor.R http://notesofdabbler.github.io/201408_hotelReview/scrapeTripAdvisor.html

If you have any tips on how to expand the list of top destinations or how to go through the other steps in a smarter way, please let me know, I'd be really keen to hear from you.

Many thanks in advance!

927

asked Apr 18 '15 05:04

IVR

1 Answers

Basically, you can try to send a click event to the <div class="morePopularCities">. Something like this :

remDr$navigate(tu)
div <- remDr$findElement("class", "morePopularCities")
div$clickElement()

To expand all locations, you can possibly repeat the above logic in a while loop. Keep clicking on the <div> until no more items available (until the div no longer in the page) :

divs <- remDr$findElements("class", "morePopularCities")
while(length(divs )>0) {
  for(div in divs ){
    div$clickElement()
  }
  divs <- remDr$findElements("class", "morePopularCities")
}

I'm not fluent in R, you may find my code example not pretty, feel free to suggest.

132

answered Nov 11 '22 17:11

har07

Related questions
                            
                                How can I use ddply with varying .variables?
                            
                                sqldf: Changes timestamp from localtime to GMT/UTC
                            
                                How to perform collaborative filtering in R
                            
                                ggplot - facet by function output
                            
                                tkplot in latex via knitr and igraph
                            
                                Arules Sequence Mining in R
                            
                                Unit tests for code in the /src folder of an R package?
                            
                                Control ggplot2 facet height independently from number of row facets
                            
                                ggplot/GGally - Parallel Coordinates - y-axis labels
                            
                                Async server or quickly loading state in R
                            
                                Rotation in 'FactoMineR' package
                            
                                Non-monotonic output of Constrained Optimisation in R
                            
                                How to make saving ggplot2 objects more efficient?
                            
                                Unquote string in R's substitute command
                            
                                How to show working directory in R prompt? [duplicate]
                            
                                Add margins with grid R package
                            
                                R Shiny Selectize: How to set the minimum number of options in selectizeInput
                            
                                Change y axis limits for each row of a facet plot in ggplot2
                            
                                Weights with plm package
                            
                                R code in package vignette cannot run on CRAN for security reasons. How to manage such vignette?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping data from TripAdvisor using R

Tags:

r

xpath

rselenium

IVR

People also ask

1 Answers

har07

Recent Activity

Donate For Us