Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can XPath and XQuery work on HTML documents?

I heard that an HTML document is not a XML document from https://stackoverflow.com/a/39560454.

XPath and XQuery work on XML documents. Can they work on HTML documents, and why?

Although I don't know why, I guess XPath can work on HTML documens, because of https://www.quora.com/Why-do-we-use-XPath-in-Selenium-even-though-CSS-Selector-is-faster and https://html-agility-pack.net/

like image 226
Tim Avatar asked Apr 23 '19 22:04

Tim


1 Answers

XQuery and XPath are defined to work on a particular data model called XDM. In XPath 1.0 this is described within the XPath specification; in XQuery and later XPath versions it is defined in a separate specification. XPath and XQuery can work on any data for which a mapping to XDM is defined. The XML and HTML DOM both differ in a number of details from XDM, but it is possible (with a bit of pragmatism) to define a mapping to XDM, and therefore XPath can be made to run against both XML and HTML DOMs. And indeed, both these mappings are very widely used, even though they are imperfect and in some cases inefficient.

The biggest problem with the HTML mapping to XDM is namespaces; XPath implementations traditionally regard HTML elements such as "table" and "p" as being in no namespace, so paths such as //table//p can be used, without namespace prefixes. But in HTML5, the WhatWG decided that these elements are in the XHTML namespace, which meant that they had to define a variation to the XPath spec to accommodate such paths.

CSS selectors have slowly acquired much of the expressive power of XPath 1.0, though they are certainly not as rich as later versions, and since they are designed primarily for HTML rather than XML, they can sometimes be more convenient to use. I haven't seen any performance data, but the browser vendors have by necessity put a lot of effort into making CSS fast, and they seem to have done almost zero development on their XPath implementations in the last 15 years, so it certainly wouldn't surprise me if CSS is faster in most browsers. The differences between DOM and XDM also create overheads: notably the very inefficient representation of namespaces in DOM.

like image 126
Michael Kay Avatar answered Sep 27 '22 18:09

Michael Kay