Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Library to query HTML with XPath in Java?

Can anyone recommend me a java library to allow me XPath Queries over URLs? I've tried JAXP without success.

Thank you.

like image 953
Leonardo Marques Avatar asked Jul 29 '10 09:07

Leonardo Marques


People also ask

What is XPath library in Java?

In this Java XPath tutorial, we will learn what is XPath library, what are XPath data types and learn to create XPath expression syntax to retrieve information from XML file or document. This information can be XML nodes or XML attributes or even comments as well.

What is an XPath query?

XPath stands for "XML Path Language" which essentially means it's a query language that described a path from point A to point B for XML/HTML type of documents. Other path languages you might know of are CSS selectors which usually describe paths to apply styles to, or tool specific languages like jq which describe path for JSON type documents.

Is it possible to use XPath in Python for HTML?

Since HTML is just a subset of XML we can safely use xpath in almost every modern language! In Python there are multiple packages that implement xpath functionality, however most of them are based on lxml package which is a pythonic binding of libxml2 and libxslt C language libraries.

What is an XPath node in XML?

Every element in the original XML document is represented by an XPath element node. For example in our sample XML below are element nodes. 2.3. Attribute Nodes At a minimum, an element node is the parent of one attribute node for each attribute in the XML source document.


2 Answers

There are several different approaches to this documented on the Web:

Using HtmlCleaner

  • HtmlCleaner / Java DOM parser - Using XPath Contains against HTML in Java (This is the way I recommend)
  • HtmlCleaner itself has a built in utility supporting XPath - See the javadocs http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/XPather.html or this example http://thinkandroid.wordpress.com/2010/01/05/using-xpath-and-html-cleaner-to-parse-html-xml/

Using Jericho

  • Jericho and Jaxen http://sujitpal.blogspot.com/2009/04/xpath-over-html-using-jericho-and-jaxen.html

I have tried a few different variations of these approaches, i.e. HtmlParser plus the Java DOM parser, and JSoup plus Jaxen, but the combination that worked best is HtmlCleaner plus the Java DOM parser. The next best combination was Jericho plus Jaxen.

like image 101
Mark Butler Avatar answered Oct 22 '22 02:10

Mark Butler


jsoup, Java HTML Parser Very similar to jQuery syntax way.

like image 44
Artem Barger Avatar answered Oct 22 '22 02:10

Artem Barger