There's some work in progress related to adding xpath support to jsoup https://github.com/jhy/jsoup/pull/80.
With XPath expressions it is able to select the elements within the HTML using Jsoup as HTML parser.
What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation.
Description. The parse(String html) method parses the input HTML into a new Document. This document object can be used to traverse and get details of the html dom.
JSoup doesn't support XPath yet, but you may try XSoup - "Jsoup with XPath".
Here's an example quoted from the projects Github site (link):
@Test public void testSelect() { String html = "<html><div><a href='https://github.com'>github.com</a></div>" + "<table><tr><td>a</td><td>b</td></tr></table></html>"; Document document = Jsoup.parse(html); String result = Xsoup.compile("//a/@href").evaluate(document).get(); Assert.assertEquals("https://github.com", result); List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list(); Assert.assertEquals("a", list.get(0)); Assert.assertEquals("b", list.get(1)); }
There you'll also find a list of features and expressions of XPath that are supported by XSoup.
Not yet,but the project JsoupXpath has make it.For example,
String html = "<html><body><script>console.log('aaaaa')</script><div class='test'>some body</div><div class='xiao'>Two</div></body></html>"; JXDocument underTest = JXDocument.create(html); String xpath = "//div[contains(@class,'xiao')]/text()"; JXNode node = underTest.selNOne(xpath); Assert.assertEquals("Two",node.asString());
By the way,it supports the complete W3C XPATH 1.0 standard syntax.Such as
//ul[@class='subject-list']/li[./div/div/span[@class='pl']/num()>(1000+90*(2*50))][last()][1]/div/h2/allText() //ul[@class='subject-list']/li[not(contains(self::li/div/div/span[@class='pl']//text(),'14582'))]/div/h2//text()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With