In package XML
, in the ?htmlParse
examples section, is the following function getLinks()
.
getLinks <- function() {
links <- character()
list(a = function(node, ...) {
links <<- c(links, xmlGetAttr(node, "href"))
node
},
links = function()links)
}
After using it, and then staring at it for a while, I still cannot wrap my head around the sequence of events occurring in the body of the function.
> bod <- as.list(body(getLinks))
> c(bod, rapply(bod, as.list))
[[1]]
`{`
[[2]]
links <- character()
[[3]]
list(a = function(node, ...) {
links <<- c(links, xmlGetAttr(node, "href"))
node
}, links = function() links)
$a
function (node, ...)
{
links <<- c(links, xmlGetAttr(node, "href"))
node
}
<environment: 0x595f7f0>
$links
function ()
links
<environment: 0x595f7f0>
Can someone provide a detailed explanation of the chain of events that occur in this function?
For a sample, run the following code:
> library(XML)
> URL <- "http://www.retrosheet.org/game.htm"
> h1 <- getLinks()
> htmlTreeParse(URL, handlers = h1)
> h1$links()
On it's own the function really doesn't do much. it's really only useful in the context of htmlTreeParse
. What it does do is two things. Firstly, it creates an enclosure/environment where a vector of links will be collected. Secondly, it returns a list which can be used as a handler=
in htmlTreeParse
. According to the documentation, a handler is an
Optional collection of functions used to map the different XML nodes to R objects. Typically, this is a named list of functions, and a closure can be used to provide local data. This provides a way of filtering the tree as it is being created in R, adding or removing nodes, and generally processing them as they are constructed in the C code.
So htmlTreeParse
will look in the list for names that match the node names of the elements in the HTML file. So since the list has an "a" element, that function will be called for each <a>
(link) tag in the document. The function simply extracts the href
attribute, which is where the URL is stored, and adds it to the links
array in the enclosure.
Finally, after parsing is done, you need a way to be able to access that links
vector inside the closure. So the list also defines a "links" element. This is a function that just returns the protected vector. You could have called this function anything you like so long as it didn't match the name of a tag in the HTML document.
So this getLinks()
function just returns a list which can be used as a handler. Most of the real work is done in the htmlTreeParse
function.
In addition to the excellent explanation of Mr Flick , here a simple demonstration of how this function is working:
getLinks <- function() {
links <- character()
list(a = function(node, ...) {
links <<- c(links, node) ## I omit the call to XMLGetAttr
node
},
links = function()links)
}
h1 = getLinks()
Now I just call the function many times , and print the resulted links at each call:
for (i in 1:3 ){
print(h1$links())
h1$a(paste0("node",i))
}
As you see , links is a just a list that is incremented at each call to getlinks
by the new links found :
character(0)
[1] "node1"
[1] "node1" "node2"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With