I want to read HTML files from a web site. Specifically, I want to read books in HTML format from gutenberg.org. The title of each chapter is marked with the tag "h2" and the content of each chapter follows in the paragraph tags "p" after the "h2". Using the package XML I am able to get the values or the full HTML code for each tag.
Here is a sample code using George Elliot's Middlemarch:
library(XML)
doc.html = htmlTreeParse('http://www.gutenberg.org/files/145/145-h/145-h.htm',
useInternal = TRUE)
doc.value <- xpathApply(doc.html, '//h2|//p', xmlValue)
doc.html.value <- xpathApply(doc.html, '//h2|//p')
doc.value contains a list where each element is the content of the tags but I cannot know whether is a h2 tag or p tag. On the other hand, doc.html.value contains a list with the html code for each tag. This gives me the information whether it is an "h2" or "p" tag but it also contains a lot of of extra code (like style information, etc) that I don't need.
My question: Is there a simple way to obtain only the type of the tag and the value of the tag without the other information associated with it?
Looking at the documentation for xmlValue
suggests that there is another function by the name of xmlName
, which extracts just the name of the tag. Using these two, what you want can be computed:
doc.html.name.value <- xpathApply(doc.html, '//h2|//p', function(x) { list(name=xmlName(x), content=xmlValue(x)); })
> doc.html.name.value[[1]]
$name
[1] "h2"
$content
[1] "\r\nGeorge Eliot\r\n"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With