I am trying to scrape phone number from this website: http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53. The phone number can be scraped with rvest
package with selector .\'id_raw\'\::nth-child(1) span+ div strong
(suggested by [selectorGadget] http://selectorgadget.com/).
The problem is that information can be obtained after its mask is clicked. So somehow I have to open a session, provide a click and then scrape information.
EDIT By the way it's not a link imho. Have a look at source. I have a problem because I'm a regular R user, not a javascript programer.
You can grab the data embedded in the <li>
tags that tells the onclick
handler what to do and just get the data directly:
library(httr)
library(rvest)
library(purrr)
library(stringr)
URL <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53"
pg <- read_html(URL)
html_nodes(pg, "li.rel") %>% # get the 'special' <li> tags
html_attrs() %>% # extract all the attrs (they're non-standard)
flatten_chr() %>% # list to character vector
keep(~grepl("rel \\{", .x)) %>% # only want ones with 'hidden' secret data
str_extract("(\\{.*\\})") %>% # only get the data
unique() %>% # there are duplicates
map_df(function(x) {
path <- str_match(x, "'path':'([[:alnum:]]+)'")[,2] # extract out the path
id <- str_match(x, "'id':'([[:alnum:]]+)'")[,2] # extract out the id
ajax <- sprintf("http://olx.pl/ajax/misc/contact/%s/%s/", path, id) # make the AJAX/XHR URL
value <- content(GET(ajax))$value # get the data
data.frame(path=path, id=id, value=value, stringsAsFactors=FALSE) # make a data frame
})
## Source: local data frame [3 x 3]
##
## path id value
## (chr) (chr) (chr)
## 1 phone dX6wf 503 155 744
## 2 skype dX6wf e.bobruk
## 3 communicator dX6wf 7686136
Having done all that, I'm pretty disappointed that site doesn't have a better Terms of Service/Use. It's fairly obvious they really don't want you scraping this data.
Here's a solution using RSelenium
, (RSelenium introduction) and phantomjs.
However, I'm not sure how usable it is because it runs very slow on my machine, and I'm not a phantomjs or selenium expert so I don't know where speed improvements can be made yet, so something to look into...
Edit
I've tried this again and it seems to be ok for speed.
library(RSelenium)
library(rvest)
## Terminal command to start selenium (on ubuntu)
## cd ~/selenium && java -jar selenium-server-standalone-2.48.2.jar
url <- "http://olx.pl/oferta/pokoj-1-os-bielany-encyklopedyczna-CID3-IDdX6wf.html#c1c0e14c53"
RSelenium::startServer()
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(url)
# css <- ".cpointer:nth-child(1)" ## couldn't get this to work
xp <- "//div[@class='contactbox-indent rel brkword']"
webElem <- remDr$findElement(using = 'xpath', xp)
# webElem <- remDr$findElement(using = 'css selector', css)
webElem$clickElement()
## the page source now includes the clicked element
page_source <- remDr$getPageSource()[[1]]
pos <- regexpr('class=\\"xx-large', page_source)
## you could write a more intelligent regex, but this works for now
phone_number <- substr(page_source, pos + 11, pos + 21)
phone_number
# "503 155 744"
# remDr$close()
# remDr$closeServer()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With