Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to web-scrape on-click information with R?

I am trying to scrape phone number from this website: http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53. The phone number can be scraped with rvest package with selector .\'id_raw\'\::nth-child(1) span+ div strong (suggested by [selectorGadget] http://selectorgadget.com/).

The problem is that information can be obtained after its mask is clicked. So somehow I have to open a session, provide a click and then scrape information.

EDIT By the way it's not a link imho. Have a look at source. I have a problem because I'm a regular R user, not a javascript programer.

enter image description here

like image 794
Marcin Kosiński Avatar asked Feb 13 '16 19:02

Marcin Kosiński


2 Answers

You can grab the data embedded in the <li> tags that tells the onclick handler what to do and just get the data directly:

library(httr)
library(rvest)
library(purrr)
library(stringr)

URL <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53"

pg <- read_html(URL)

html_nodes(pg, "li.rel") %>%       # get the 'special' <li> tags
  html_attrs() %>%                 # extract all the attrs (they're non-standard)
  flatten_chr() %>%                # list to character vector
  keep(~grepl("rel \\{", .x)) %>%  # only want ones with 'hidden' secret data
  str_extract("(\\{.*\\})") %>%    # only get the data
  unique() %>%                     # there are duplicates
  map_df(function(x) {

    path <- str_match(x, "'path':'([[:alnum:]]+)'")[,2]                  # extract out the path
    id <- str_match(x, "'id':'([[:alnum:]]+)'")[,2]                      # extract out the id

    ajax <- sprintf("http://olx.pl/ajax/misc/contact/%s/%s/", path, id)  # make the AJAX/XHR URL
    value <- content(GET(ajax))$value                                    # get the data

    data.frame(path=path, id=id, value=value, stringsAsFactors=FALSE)    # make a data frame

  }) 

## Source: local data frame [3 x 3]
## 
##           path    id       value
##          (chr) (chr)       (chr)
## 1        phone dX6wf 503 155 744
## 2        skype dX6wf    e.bobruk
## 3 communicator dX6wf     7686136

Having done all that, I'm pretty disappointed that site doesn't have a better Terms of Service/Use. It's fairly obvious they really don't want you scraping this data.

like image 128
hrbrmstr Avatar answered Nov 14 '22 22:11

hrbrmstr


Here's a solution using RSelenium, (RSelenium introduction) and phantomjs.

However, I'm not sure how usable it is because it runs very slow on my machine, and I'm not a phantomjs or selenium expert so I don't know where speed improvements can be made yet, so something to look into...

Edit

I've tried this again and it seems to be ok for speed.

library(RSelenium)
library(rvest)

## Terminal command to start selenium (on ubuntu)
## cd ~/selenium && java -jar selenium-server-standalone-2.48.2.jar
url <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53"

RSelenium::startServer()
remDr <- remoteDriver(browserName = "phantomjs")

remDr$open()
remDr$navigate(url)

# css <- ".cpointer:nth-child(1)"  ## couldn't get this to work
xp <- "//div[@class='contactbox-indent rel brkword']"
webElem <- remDr$findElement(using = 'xpath', xp)

# webElem <- remDr$findElement(using = 'css selector', css)
webElem$clickElement()

## the page source now includes the clicked element
page_source <- remDr$getPageSource()[[1]]
pos <- regexpr('class=\\"xx-large', page_source)

## you could write a more intelligent regex, but this works for now
phone_number <- substr(page_source, pos + 11, pos + 21)
phone_number
# "503 155 744"

# remDr$close()
# remDr$closeServer()
like image 37
tospig Avatar answered Nov 14 '22 21:11

tospig