Is there a simple way in R to extract only the text elements of an HTML page?
I think this is known as 'screen scraping' but I have no experience of it, I just need a simple way of extracting the text you'd normally see in a browser when visiting a url.
Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.
I had to do this once upon time myself.
One way of doing it is to make use of XPath expressions. You will need these packages installed from the repository at http://www.omegahat.org/
library(RCurl) library(RTidyHTML) library(XML)
We use RCurl to connect to the website of interest. It has lots of options which allow you to access websites that the default functions in base R would have difficulty with I think it's fair to say. It is an R-interface to the libcurl library.
We use RTidyHTML to clean up malformed HTML web pages so that they are easier to parse. It is an R-interface to the libtidy library.
We use XML to parse the HTML code with our XPath expressions. It is an R-interface to the libxml2 library.
Anyways, here's what you do (minimal code, but options are available, see help pages of corresponding functions):
u <- "http://stackoverflow.com/questions/tagged?tagnames=r" doc.raw <- getURL(u) doc <- tidyHTML(doc.raw) html <- htmlTreeParse(doc, useInternal = TRUE) txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue) cat(unlist(txt))
There may be some problems with this approach, but I can't remember what they are off the top of my head (I don't think my xpath expression works with all web pages, sometimes it might not filter out script code or it may plain just not work with some other pages at all, best to experiment!)
P.S. Another way, which works almost perfectly I think at web scraping all text from html is the following (basically getting Internet Explorer to do the conversion for you):
library(RDCOMClient) u <- "http://stackoverflow.com/questions/tagged?tagnames=r" ie <- COMCreate("InternetExplorer.Application") ie$Navigate(u) txt <- list() txt[[u]] <- ie[["document"]][["body"]][["innerText"]] ie$Quit() print(txt)
HOWEVER, I've never liked doing this because not only is it slow, but if you vectorise it and apply a vector of URLs, if internet explorer crashes on a bad page, then R might hang or crash itself (I don't think ?try helps that much in this case). Also it's prone to allowing pop-ups. I don't know, it's been a while since I've done this, but thought I should point this out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With