So here's the challenge... I need to create clean HTML from random web pages out there in the wild. My goal is to read in a page and pass it off to a library which will in turn give me back perfectly well-formed HTML.
Doesn't sound so tough, right? After all, every browser on the market effectively deals with the challenge of malformed HTML and turning it into something render-able with nearly every page load. Each has its own slightly particular algorithm for cleaning up the contents (ahem...for HTML < 5 that is), but they tend to do a very good job of capturing what i like to refer to as the author's intention. So then, why can't I find a good java library for this very task?
One thing to mention is that I'm not at all interested in parsing the HTML as XML. I've found that libraries such as NekoHTML, TagSoup, HtmlCleaner, and JTidy (to name a few) are more focused on solving the problem of converting to HTML to valid XML, and in the process, they lose sight of how the poorly-formatted document should be re-structured. With nasty HTML they frequently don't capture the author's intention and spit out documents that render quite differently from the original source. And for this project, it's of the utmost importance that the two documents render similarly.
I am quite fond of Jericho HTML, but it doesn't seem to be the ideal candidate for this job...at least not without a lot of effort on my part. Also, Native dependencies are a no-go, so the mozilla parser is out.
Can anyone help me in my search for the perfect HTML parser? Thanks in advance!
Jsoup is an open source Java library used mainly for extracting data from HTML. It also allows you to manipulate and output HTML. It has a steady development line, great documentation, and a fluent and flexible API. Jsoup can also be used to parse and build XML.
What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
JSoup I would say
See Also
I have used HTML Tidy in the past.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With