Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java library for cleaning up HTML just like a browser would

So here's the challenge... I need to create clean HTML from random web pages out there in the wild. My goal is to read in a page and pass it off to a library which will in turn give me back perfectly well-formed HTML.

Doesn't sound so tough, right? After all, every browser on the market effectively deals with the challenge of malformed HTML and turning it into something render-able with nearly every page load. Each has its own slightly particular algorithm for cleaning up the contents (ahem...for HTML < 5 that is), but they tend to do a very good job of capturing what i like to refer to as the author's intention. So then, why can't I find a good java library for this very task?

One thing to mention is that I'm not at all interested in parsing the HTML as XML. I've found that libraries such as NekoHTML, TagSoup, HtmlCleaner, and JTidy (to name a few) are more focused on solving the problem of converting to HTML to valid XML, and in the process, they lose sight of how the poorly-formatted document should be re-structured. With nasty HTML they frequently don't capture the author's intention and spit out documents that render quite differently from the original source. And for this project, it's of the utmost importance that the two documents render similarly.

I am quite fond of Jericho HTML, but it doesn't seem to be the ideal candidate for this job...at least not without a lot of effort on my part. Also, Native dependencies are a no-go, so the mozilla parser is out.

Can anyone help me in my search for the perfect HTML parser? Thanks in advance!

like image 597
stevevls Avatar asked May 24 '11 15:05

stevevls


People also ask

How do you process HTML in Java?

Jsoup is an open source Java library used mainly for extracting data from HTML. It also allows you to manipulate and output HTML. It has a steady development line, great documentation, and a fluent and flexible API. Jsoup can also be used to parse and build XML.

What is the use of jsoup in Java?

What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What is org jsoup in Java?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.


2 Answers

JSoup I would say

See Also

  • which-html-parser-is-best
like image 78
jmj Avatar answered Oct 21 '22 12:10

jmj


I have used HTML Tidy in the past.

like image 35
Chris Nava Avatar answered Oct 21 '22 14:10

Chris Nava