Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

At what point does using a handler function improve HTML parsing efficiency?

Tags:

html

r

xml

This question uses the R language. It is also tagged [xml] and [html] in case those users might have any input regarding the question.


With package XML, I have always been under the impression that using a handler function to parse an HTML document as it's being created at the C-level will improve overall efficiency. However, I've been working for a while now to find a situation in which that thought is actually made true.

I think perhaps I'm not thinking about the situation in the right context (i.e. maybe a handler will be more useful on a larger, recursive document?). Anyway, here's my go at it.

Take the following two examples.


library(XML)
library(microbenchmark)
u <- "http://www.baseball-reference.com"

Example 1: Get the attributes of all nodes named "input" (search form names)

withHandler1 <- function() {
    h <- function() {
        input <- character()
        list(input = function(node, ...) {
            input <<- c(input, list(xmlAttrs(node, ...)))
            node
        },
            value = function() input)
    }
    h1 <- h()
    htmlParse(u, handler = h1)
    h1$value()
}

withoutHandler1 <- function() {
    xmlApply(htmlParse(u)["//input"], xmlAttrs)
}

identical(withHandler1(), withoutHandler1())
# [1] TRUE

microbenchmark(withHandler1(), withoutHandler1(), times = 25L)
# Unit: milliseconds
#              expr      min       lq     mean   median       uq     max neval cld
#    withHandler1() 944.6507 1001.419 1051.602 1020.347 1097.073 1315.23    25   a
# withoutHandler1() 964.6079 1006.799 1040.905 1039.993 1069.029 1126.49    25   a

Okay, that was a very basic example but the timings are virtually the same and I feel as if I ran it for the default 100 times they might converge.


Example 2: Get a subset of the attributes of all nodes named "input"

withHandler2  <- function() {    
    searchBoxHandler <- function(attr = character()) {
        input <- character()
        list(input = function(node, ...) {
            input <<- c(input, list(
                if(identical(attr, character())) xmlAttrs(node, ...)
                else vapply(attr[attr %in% names(xmlAttrs(node))],
                    xmlGetAttr, "", node = node)
            ))
            node
        },
            value = function() input)
    }
    h1 <- searchBoxHandler(attr = c("id", "type"))
    htmlParse(u, handler = h1)
    h1$value()
}    

withoutHandler2 <- function() {
    xmlApply(htmlParse(u)["//input"], function(x) {
        ## Note: match() used only to return identical objects
        xmlAttrs(x)[na.omit(match(c("id", "type"), names(xmlAttrs(x))))]
    })
}

identical(withHandler2(), withoutHandler2())
# [1] TRUE

microbenchmark(withHandler2(), withoutHandler2(), times = 25L)
# Unit: milliseconds
#              expr      min        lq     mean   median       uq      max neval cld
#    withHandler2() 966.0951 1010.3940 1129.360 1038.206 1119.642 2075.070    25   a
# withoutHandler2() 962.8655  999.4754 1166.231 1046.204 1118.661 2385.782    25   a

Again, very basic. But also almost the same.


So my question is, why use a handler function at all? For these examples, it turned out to be a waste of effort to write the handlers. So are there specific operations that can be very costly, that when parsing HTML I would see a significant improvement in speed and efficiency by using a handler function?

like image 383
Rich Scriven Avatar asked Oct 20 '22 15:10

Rich Scriven


1 Answers

Referring to XML article on wikipedia , Programming interfaces section :

  1. Existing APIs for XML processing tend to fall into these categories: Stream-oriented APIs accessible from a programming language, for example SAX and StAX.
  2. Tree-traversal APIs accessible from a programming language, for example DOM.
  3. XML data binding, which provides an automated translation between an XML document and programming-language objects.
  4. Declarative transformation languages such as XSLT and XQuery.

Stream-oriented facilities require less memory and, for certain tasks which are based on a linear traversal of an XML document, are faster and simpler than other alternatives. Tree-traversal and data-binding APIs typically require the use of much more memory, but are often found more convenient for use by programmers; some include declarative retrieval of document components via the use of XPath expressions. XSLT is designed for declarative description of XML document transformations, and has been widely implemented both in server-side packages and Web browsers. XQuery overlaps XSLT in its functionality, but is designed more for searching of large XML databases.

It is very clear now that performance is not the only factor to be considered , for example :

SAX is fast and efficient to implement, but difficult to use for extracting information at random from the XML, since it tends to burden the application author with keeping track of what part of the document is being processed. It is better suited to situations in which certain types of information are always handled the same way, no matter where they occur in the document.

on other hand :

The Document Object Model (DOM) is an interface-oriented application programming interface that allows for navigation of the entire document as if it were a tree of node objects representing the document's contents. A DOM document can be created by a parser, or can be generated manually by users (with limitations). Data types in DOM nodes are abstract; implementations provide their own programming language-specific bindings. DOM implementations tend to be memory intensive, as they generally require the entire document to be loaded into memory and constructed as a tree of objects before access is allowed.

to sum it up:

your examples are not a live example where data can be much more bigger , only then the circumstances will decide to best interface to be used.

like image 61
ProllyGeek Avatar answered Oct 23 '22 01:10

ProllyGeek