I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it
In general, web scraping in R (or in any other language) boils down to the following three steps: Get the HTML for the web page that you want to scrape. Decide what part of the page you want to read and find out what HTML/CSS you need to select it. Select the HTML and analyze it in the way you need.
Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.
Here's an example to get you started:
require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
This results in a character vector of mostly just webpage text (along with some javascript):
> head(x)
[1] "Subscribe to Print Edition" "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time:Â 16:48Â (EST+7)"
[4] "Â Â Make Haaretz your homepage" "/*check the search form*/" "function chkSearch()"
Your best bet may be the XML package -- see for example this previous question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With