Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

htmlTreeParse handler explanation

Tags:

r

In package XML, in the ?htmlParse examples section, is the following function getLinks().

getLinks <- function() { 
       links <- character() 
       list(a = function(node, ...) { 
                   links <<- c(links, xmlGetAttr(node, "href"))
                   node 
                }, 
            links = function()links)
     }

After using it, and then staring at it for a while, I still cannot wrap my head around the sequence of events occurring in the body of the function.

> bod <- as.list(body(getLinks))
> c(bod, rapply(bod, as.list))
[[1]]
`{`

[[2]]
links <- character()

[[3]]
list(a = function(node, ...) {
    links <<- c(links, xmlGetAttr(node, "href"))
    node
}, links = function() links)

$a
function (node, ...) 
{
    links <<- c(links, xmlGetAttr(node, "href"))
    node
}
<environment: 0x595f7f0>

$links
function () 
links
<environment: 0x595f7f0>

Can someone provide a detailed explanation of the chain of events that occur in this function?

For a sample, run the following code:

> library(XML)
> URL <- "http://www.retrosheet.org/game.htm"
> h1 <- getLinks()
> htmlTreeParse(URL, handlers = h1)
> h1$links()
like image 921
Rich Scriven Avatar asked Dec 25 '22 05:12

Rich Scriven


2 Answers

On it's own the function really doesn't do much. it's really only useful in the context of htmlTreeParse. What it does do is two things. Firstly, it creates an enclosure/environment where a vector of links will be collected. Secondly, it returns a list which can be used as a handler= in htmlTreeParse. According to the documentation, a handler is an

Optional collection of functions used to map the different XML nodes to R objects. Typically, this is a named list of functions, and a closure can be used to provide local data. This provides a way of filtering the tree as it is being created in R, adding or removing nodes, and generally processing them as they are constructed in the C code.

So htmlTreeParse will look in the list for names that match the node names of the elements in the HTML file. So since the list has an "a" element, that function will be called for each <a> (link) tag in the document. The function simply extracts the href attribute, which is where the URL is stored, and adds it to the links array in the enclosure.

Finally, after parsing is done, you need a way to be able to access that links vector inside the closure. So the list also defines a "links" element. This is a function that just returns the protected vector. You could have called this function anything you like so long as it didn't match the name of a tag in the HTML document.

So this getLinks() function just returns a list which can be used as a handler. Most of the real work is done in the htmlTreeParse function.

like image 89
MrFlick Avatar answered Jan 08 '23 18:01

MrFlick


In addition to the excellent explanation of Mr Flick , here a simple demonstration of how this function is working:

 getLinks <- function() { 
  links <- character() 
  list(a = function(node, ...) { 
    links <<- c(links,  node)  ## I omit the call to XMLGetAttr
    node 
  }, 
  links = function()links)
}
h1 = getLinks()

Now I just call the function many times , and print the resulted links at each call:

for (i in 1:3 ){
  print(h1$links())
  h1$a(paste0("node",i))
}

As you see , links is a just a list that is incremented at each call to getlinks by the new links found :

character(0)
[1] "node1"
[1] "node1" "node2"
like image 21
agstudy Avatar answered Jan 08 '23 18:01

agstudy