Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract Links from Webpage using R

Tags:

r

web-scraping

The two posts below are great examples of different approaches of extracting data from websites and parsing it into R.

Scraping html tables into R data frames using the XML package

How can I use R (Rcurl/XML packages ?!) to scrape this webpage

I am very new to programming, and am just starting out with R, so I am hoping this question is pretty basic, but given those posts above, I imagine that it is.

All I am looking to do is extract links that match a given pattern. I feel like I could probably use RCurl to read in the web pages and extract them brute force method using string expressions. That said, if the webpage is fairly well formed, how would I go about doing so using the XML package.

As I learn more, I like to "look" at the data as I work through the problem. The issue is that some of these approaches generate lists of lists of lists, etc., so it is hard for someone that is new (like me) to walk through where I need to go.

Again, I am very new to all that is programming, so any help or code snippets will be greatly appreciated.

like image 831
Btibert3 Avatar asked Sep 19 '10 15:09

Btibert3


People also ask

Can R be used for web scraping?

There are several web scraping tools out there to perform the task and various languages too, having libraries that support web scraping. Among all these languages, R is considered as one of the programming languages for Web Scraping because of features like – a rich library, easy to use, dynamically typed, etc.

What is Rvest?

rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.


2 Answers

Even easier with rvest:

library(xml2) library(rvest)  URL <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"  pg <- read_html(URL)  head(html_attr(html_nodes(pg, "a"), "href"))  ## [1] "//stackoverflow.com"                                                                                                                                           ## [2] "http://chat.stackoverflow.com"                                                                                                                                 ## [3] "//stackoverflow.com"                                                                                                                                           ## [4] "http://meta.stackoverflow.com"                                                                                                                                 ## [5] "//careers.stackoverflow.com?utm_source=stackoverflow.com&utm_medium=site-ui&utm_campaign=multicollider"                                                        ## [6] "https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=http%3a%2f%2fstackoverflow.com%2fquestions%2f3746256%2fextract-links-from-webpage-using-r" 
like image 74
hrbrmstr Avatar answered Sep 23 '22 23:09

hrbrmstr


The documentation for htmlTreeParse shows one method. Here's another:

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r" > doc <- htmlParse(url) > links <- xpathSApply(doc, "//a/@href") > free(doc) 

(You can drop the "href" attribute from the returned links by passing "links" through "as.vector".)

My previous reply:

One approach is to use Hadley Wickham's stringr package, which you can install with install.packages("stringr", dep=TRUE).

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r" > html <- paste(readLines(url), collapse="\n") > library(stringr) > matched <- str_match_all(html, "<a href=\"(.*?)\"") 

(I guess some people might not approve of using regexp's here.)

matched is a list of matrixes, one per input string in the vector html -- since that has length one here, matched just has one element. The matches for the first capture group are in column 2 of this matrix (and in general, the ith group would appear in column (i + 1)).

> links <- matched[[1]][, 2] > head(links) [1] "/users/login?returnurl=%2fquestions%2f3746256%2fextract-links-from-webpage-using-r" [2] "http://careers.stackoverflow.com"                                                   [3] "http://meta.stackoverflow.com"                                                      [4] "/about"                                                                             [5] "/faq"                                                                               [6] "/" 
like image 25
David F Avatar answered Sep 22 '22 23:09

David F