Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML file in R

I want to read HTML files from a web site. Specifically, I want to read books in HTML format from gutenberg.org. The title of each chapter is marked with the tag "h2" and the content of each chapter follows in the paragraph tags "p" after the "h2". Using the package XML I am able to get the values or the full HTML code for each tag.

Here is a sample code using George Elliot's Middlemarch:

library(XML)

doc.html = htmlTreeParse('http://www.gutenberg.org/files/145/145-h/145-h.htm',
                         useInternal = TRUE)
doc.value <- xpathApply(doc.html, '//h2|//p', xmlValue)
doc.html.value <- xpathApply(doc.html, '//h2|//p')

doc.value contains a list where each element is the content of the tags but I cannot know whether is a h2 tag or p tag. On the other hand, doc.html.value contains a list with the html code for each tag. This gives me the information whether it is an "h2" or "p" tag but it also contains a lot of of extra code (like style information, etc) that I don't need.

My question: Is there a simple way to obtain only the type of the tag and the value of the tag without the other information associated with it?

like image 597
user2840286 Avatar asked Nov 02 '13 23:11

user2840286


1 Answers

Looking at the documentation for xmlValue suggests that there is another function by the name of xmlName, which extracts just the name of the tag. Using these two, what you want can be computed:

doc.html.name.value <- xpathApply(doc.html, '//h2|//p', function(x) { list(name=xmlName(x), content=xmlValue(x)); })

> doc.html.name.value[[1]]
$name
[1] "h2"

$content
[1] "\r\nGeorge Eliot\r\n"
like image 195
musically_ut Avatar answered Oct 22 '22 17:10

musically_ut