Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a simple way in R to extract only the text elements of an HTML page?

Tags:

Is there a simple way in R to extract only the text elements of an HTML page?

I think this is known as 'screen scraping' but I have no experience of it, I just need a simple way of extracting the text you'd normally see in a browser when visiting a url.

like image 231
JoshuaCrove Avatar asked Jul 07 '10 14:07

JoshuaCrove


People also ask

How do I get all the text from a website?

Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.


1 Answers

I had to do this once upon time myself.

One way of doing it is to make use of XPath expressions. You will need these packages installed from the repository at http://www.omegahat.org/

library(RCurl) library(RTidyHTML) library(XML) 

We use RCurl to connect to the website of interest. It has lots of options which allow you to access websites that the default functions in base R would have difficulty with I think it's fair to say. It is an R-interface to the libcurl library.

We use RTidyHTML to clean up malformed HTML web pages so that they are easier to parse. It is an R-interface to the libtidy library.

We use XML to parse the HTML code with our XPath expressions. It is an R-interface to the libxml2 library.

Anyways, here's what you do (minimal code, but options are available, see help pages of corresponding functions):

u <- "http://stackoverflow.com/questions/tagged?tagnames=r"  doc.raw <- getURL(u) doc <- tidyHTML(doc.raw) html <- htmlTreeParse(doc, useInternal = TRUE) txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue) cat(unlist(txt)) 

There may be some problems with this approach, but I can't remember what they are off the top of my head (I don't think my xpath expression works with all web pages, sometimes it might not filter out script code or it may plain just not work with some other pages at all, best to experiment!)

P.S. Another way, which works almost perfectly I think at web scraping all text from html is the following (basically getting Internet Explorer to do the conversion for you):

library(RDCOMClient)  u <- "http://stackoverflow.com/questions/tagged?tagnames=r" ie <- COMCreate("InternetExplorer.Application")  ie$Navigate(u) txt <- list() txt[[u]] <- ie[["document"]][["body"]][["innerText"]]  ie$Quit()  print(txt)  

HOWEVER, I've never liked doing this because not only is it slow, but if you vectorise it and apply a vector of URLs, if internet explorer crashes on a bad page, then R might hang or crash itself (I don't think ?try helps that much in this case). Also it's prone to allowing pop-ups. I don't know, it's been a while since I've done this, but thought I should point this out.

like image 159
5 revs, 2 users 95% Avatar answered Oct 18 '22 21:10

5 revs, 2 users 95%