Extract Links from Webpage using R

Tags:

web-scraping

The two posts below are great examples of different approaches of extracting data from websites and parsing it into R.

Scraping html tables into R data frames using the XML package

How can I use R (Rcurl/XML packages ?!) to scrape this webpage

I am very new to programming, and am just starting out with R, so I am hoping this question is pretty basic, but given those posts above, I imagine that it is.

All I am looking to do is extract links that match a given pattern. I feel like I could probably use RCurl to read in the web pages and extract them brute force method using string expressions. That said, if the webpage is fairly well formed, how would I go about doing so using the XML package.

As I learn more, I like to "look" at the data as I work through the problem. The issue is that some of these approaches generate lists of lists of lists, etc., so it is hard for someone that is new (like me) to walk through where I need to go.

Again, I am very new to all that is programming, so any help or code snippets will be greatly appreciated.

831

asked Sep 19 '10 15:09

Btibert3

2 Answers

Even easier with rvest:

library(xml2) library(rvest)  URL <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"  pg <- read_html(URL)  head(html_attr(html_nodes(pg, "a"), "href"))  ## [1] "//stackoverflow.com"                                                                                                                                           ## [2] "http://chat.stackoverflow.com"                                                                                                                                 ## [3] "//stackoverflow.com"                                                                                                                                           ## [4] "http://meta.stackoverflow.com"                                                                                                                                 ## [5] "//careers.stackoverflow.com?utm_source=stackoverflow.com&utm_medium=site-ui&utm_campaign=multicollider"                                                        ## [6] "https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=http%3a%2f%2fstackoverflow.com%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"

answered Sep 23 '22 23:09

hrbrmstr

The documentation for htmlTreeParse shows one method. Here's another:

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r" > doc <- htmlParse(url) > links <- xpathSApply(doc, "//a/@href") > free(doc)

(You can drop the "href" attribute from the returned links by passing "links" through "as.vector".)

My previous reply:

One approach is to use Hadley Wickham's stringr package, which you can install with install.packages("stringr", dep=TRUE).

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r" > html <- paste(readLines(url), collapse="\n") > library(stringr) > matched <- str_match_all(html, "<a href=\"(.*?)\"")

(I guess some people might not approve of using regexp's here.)

matched is a list of matrixes, one per input string in the vector html -- since that has length one here, matched just has one element. The matches for the first capture group are in column 2 of this matrix (and in general, the ith group would appear in column (i + 1)).

> links <- matched[[1]][, 2] > head(links) [1] "/users/login?returnurl=%2fquestions%2f3746256%2fextract-links-from-webpage-using-r" [2] "http://careers.stackoverflow.com"                                                   [3] "http://meta.stackoverflow.com"                                                      [4] "/about"                                                                             [5] "/faq"                                                                               [6] "/"

answered Sep 22 '22 23:09

David F

Related questions
                            
                                Changing date format to "%d/%m/%Y"
                            
                                Creating a data frame from two vectors using cbind
                            
                                How to select some rows with specific rownames from a dataframe? [closed]
                            
                                ggplot2: Divide Legend into Two Columns, Each with Its Own Title
                            
                                How to perform Lemmatization in R?
                            
                                Import data into R with an unknown number of columns?
                            
                                Standard error bars using stat_summary
                            
                                Positioning axes labels
                            
                                how to get index of sorted array elements
                            
                                how to drop columns by passing variable name with dplyr?
                            
                                ROC curve from training data in caret
                            
                                How to assign output of cat to an object?
                            
                                How to use a variable in dplyr::filter?
                            
                                How to import a .tsv file
                            
                                Remove accents from a dataframe column in R
                            
                                Error when I try to predict class probabilities in R - caret
                            
                                How to write from R to the clipboard on a mac
                            
                                Is there a way to check if a column is a Date in R?
                            
                                Draw more than one function curves in the same plot [duplicate]
                            
                                Frequency count of two column in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With