Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there any Java HTML parsers where the generated Nodes retain indexes to the original text?

I'd like to query a HTML document as XML (e.g. with XPath), so I need to pass the HTML through some form of HTML cleaner.

But I'd also like to make modifications to the original source string based on the results of the queries.

Is there a Java HTML parser around that retains indexes to the original source string, so I can locate a node and modify the correct part of the original string?

Cheers.

like image 370
Paul Grime Avatar asked Sep 03 '11 23:09

Paul Grime


People also ask

What does jsoup parse do?

What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What is a jsoup document?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

How do you process HTML in Java?

Its party trick is a CSS selector syntax to find elements, e.g.: String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc. </p></body></html>"; Document doc = Jsoup. parse(html); Elements links = doc.


1 Answers

It sounds like Jericho is almost exactly what you want. It is a robust HTML parser designed specifically for making unintrusive modifications to the source document.

While it doesn't come with DOM, SAX, or StAX interfaces, it has custom APIs that are similar enough to those standards that you should be able to adapt your approach to them fairly easily, or write an adapter between whatever you are using and Jericho. For instance, you can do XPath queries on Jericho documents using Jaxen -- see this blog entry for an example.

Jericho has begin and end attributes for every element, and even for parts of the element like the tag name or even an attribute name, so you can edit the document yourself with that information, but where Jericho really shines is the OutputDocument class, which lets you specify replacements directly by calling the appropriate methods with the Jericho elements that match your query instead of having to explicitly call getBegin() and getEnd() on them and pass that to some replacement method.

like image 71
Stop Harming the Community Avatar answered Sep 28 '22 03:09

Stop Harming the Community