Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using 'rvest' to extract links

Tags:

I am trying to scrape data from Yelp. One step is to extract links from each restaurant. For example, I search restaurants in NYC and get some results. Then I want to extract the links of all the 10 restaurants Yelp recommends on page 1. Here is what I have tried:

library(rvest)     
page=read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA")
page %>% html_nodes(".biz-name span") %>% html_attr('href')

But the code always returns 'NA'. Can anyone help me with that? Thanks!

like image 671
Allen Avatar asked Feb 06 '16 22:02

Allen


People also ask

How do you scrape an Rvest?

In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.

What is the purpose of Rvest package in R?

rvest: Easily Harvest (Scrape) Web PagesWrappers around the 'xml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML.

What does Read_html do in R?

The read_html command creates an R object, basically a list, that stores information about the web page.


2 Answers

library(rvest)     
page <- read_html("http://www.yelp.com/search?find_loc=New+York,+NY,+USA")
page %>% html_nodes(".biz-name") %>% html_attr('href')

Hope this would simplify your problem

like image 113
Bharath Avatar answered Sep 17 '22 15:09

Bharath


I also was able to clean the results from above which for me were quite noisy

links <- page %>% html_nodes("a") %>% html_attr("href")

with a simple regex string matching

links <- links[which(regexpr('common-url-element', links) >= 1)].

like image 36
oliver Avatar answered Sep 19 '22 15:09

oliver