Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping a dynamic ecommerce page with infinite scroll

Tags:

I'm using rvest in R to do some scraping. I know some HTML and CSS.

I want to get the prices of every product of a URI:

http://www.linio.com.co/tecnologia/celulares-telefonia-gps/

The new items load as you go down on the page (as you do some scrolling).

What I've done so far:

Linio_Celulares <- html("http://www.linio.com.co/celulares-telefonia-gps/")  Linio_Celulares %>%   html_nodes(".product-itm-price-new") %>%   html_text() 

And i get what i need, but just for the 25 first elements (those load for default).

 [1] "$ 1.999.900" "$ 1.999.900" "$ 1.999.900" "$ 2.299.900" "$ 2.279.900"  [6] "$ 2.279.900" "$ 1.159.900" "$ 1.749.900" "$ 1.879.900" "$ 189.900"   [11] "$ 2.299.900" "$ 2.499.900" "$ 2.499.900" "$ 2.799.000" "$ 529.900"   [16] "$ 2.699.900" "$ 2.149.900" "$ 189.900"   "$ 2.549.900" "$ 1.395.900" [21] "$ 249.900"   "$ 41.900"    "$ 319.900"   "$ 149.900"  

Question: How to get all the elements of this dynamic section?

I guess, I could scroll the page until all elements are loaded and then use html(URL). But this seems like a lot of work (i'm planning of doing this on different sections). There should be a programmatic work around.

like image 953
Omar Gonzales Avatar asked Apr 25 '15 04:04

Omar Gonzales


People also ask

Does infinite scroll improve performance?

Performance degrades If you're using infinite scrolling on a long page, you're constantly loading more and more content into memory. This will have a negative impact on page performance, since the browser has much more work to do in order to render the page.


1 Answers

As @nrussell suggested, you can use RSelenium to programatically scroll down the page before getting the source code.

You could for example do:

library(RSelenium) library(rvest) #start RSelenium checkForServer() startServer() remDr <- remoteDriver() remDr$open()  #navigate to your page remDr$navigate("http://www.linio.com.co/tecnologia/celulares-telefonia-gps/")  #scroll down 5 times, waiting for the page to load at each time for(i in 1:5){       remDr$executeScript(paste("scroll(0,",i*10000,");")) Sys.sleep(3)     }  #get the page html page_source<-remDr$getPageSource()  #parse it html(page_source[[1]]) %>% html_nodes(".product-itm-price-new") %>%   html_text() 
like image 87
NicE Avatar answered Sep 20 '22 11:09

NicE