Parsing HTML file in R

Question

I want to read HTML files from a web site. Specifically, I want to read books in HTML format from gutenberg.org. The title of each chapter is marked with the tag "h2" and the content of each chapter follows in the paragraph tags "p" after the "h2". Using the package XML I am able to get the values or the full HTML code for each tag.

Here is a sample code using George Elliot's Middlemarch:

library(XML)

doc.html = htmlTreeParse('http://www.gutenberg.org/files/145/145-h/145-h.htm',
                         useInternal = TRUE)
doc.value <- xpathApply(doc.html, '//h2|//p', xmlValue)
doc.html.value <- xpathApply(doc.html, '//h2|//p')

doc.value contains a list where each element is the content of the tags but I cannot know whether is a h2 tag or p tag. On the other hand, doc.html.value contains a list with the html code for each tag. This gives me the information whether it is an "h2" or "p" tag but it also contains a lot of of extra code (like style information, etc) that I don't need.

My question: Is there a simple way to obtain only the type of the tag and the value of the tag without the other information associated with it?

musically_ut · Accepted Answer

Looking at the documentation for xmlValue suggests that there is another function by the name of xmlName, which extracts just the name of the tag. Using these two, what you want can be computed:

doc.html.name.value <- xpathApply(doc.html, '//h2|//p', function(x) { list(name=xmlName(x), content=xmlValue(x)); })

> doc.html.name.value[[1]]
$name
[1] "h2"

$content
[1] "
George Eliot
"

Parsing HTML file in R

Tags:

html

r

xml

html-parsing

user2840286

1 Answers

musically_ut

Recent Activity

Donate For Us

Parsing HTML file in R

Tags:

html

r

xml

html-parsing

user2840286

1 Answers

musically_ut

Related questions

Recent Activity

Donate For Us