Is there a simple way in R to extract only the text elements of an HTML page?

1 Answers

I had to do this once upon time myself.

One way of doing it is to make use of XPath expressions. You will need these packages installed from the repository at http://www.omegahat.org/

library(RCurl) library(RTidyHTML) library(XML)

We use RCurl to connect to the website of interest. It has lots of options which allow you to access websites that the default functions in base R would have difficulty with I think it's fair to say. It is an R-interface to the libcurl library.

We use RTidyHTML to clean up malformed HTML web pages so that they are easier to parse. It is an R-interface to the libtidy library.

We use XML to parse the HTML code with our XPath expressions. It is an R-interface to the libxml2 library.

Anyways, here's what you do (minimal code, but options are available, see help pages of corresponding functions):

u <- "http://stackoverflow.com/questions/tagged?tagnames=r"  doc.raw <- getURL(u) doc <- tidyHTML(doc.raw) html <- htmlTreeParse(doc, useInternal = TRUE) txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue) cat(unlist(txt))

There may be some problems with this approach, but I can't remember what they are off the top of my head (I don't think my xpath expression works with all web pages, sometimes it might not filter out script code or it may plain just not work with some other pages at all, best to experiment!)

P.S. Another way, which works almost perfectly I think at web scraping all text from html is the following (basically getting Internet Explorer to do the conversion for you):

library(RDCOMClient)  u <- "http://stackoverflow.com/questions/tagged?tagnames=r" ie <- COMCreate("InternetExplorer.Application")  ie$Navigate(u) txt <- list() txt[[u]] <- ie[["document"]][["body"]][["innerText"]]  ie$Quit()  print(txt)

HOWEVER, I've never liked doing this because not only is it slow, but if you vectorise it and apply a vector of URLs, if internet explorer crashes on a bad page, then R might hang or crash itself (I don't think ?try helps that much in this case). Also it's prone to allowing pop-ups. I don't know, it's been a while since I've done this, but thought I should point this out.

159

answered Oct 18 '22 21:10

5 revs, 2 users 95%

Related questions
                            
                                anonymous namespace
                            
                                Is there any benefit to making a C# field read-only if its appropriate?
                            
                                How do I configure Apache2 to allow multiple simultaneous connections from same IP address?
                            
                                Chart library for Qt [closed]
                            
                                Yii multi page form wizard best practice
                            
                                Alternative Map API's (Like Google Maps) [closed]
                            
                                Unable to fire jQuery change() event on selectlist from WatiN
                            
                                How much do forward declarations affect compile time?
                            
                                Tomcat 7 session cookie path
                            
                                Game engine for iPhone/Android [closed]
                            
                                How to test if SQL Server database is in Single User Mode
                            
                                WebSocket live server [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a simple way in R to extract only the text elements of an HTML page?

Tags:

JoshuaCrove

People also ask

1 Answers

5 revs, 2 users 95%

Recent Activity

Donate For Us