Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an alternative to HTML Tidy?

I have embedded HTML Tidy in my application to clean incoming HTML. But Tidy has a huge amount of bugs and fixing them directly in the source is my worst nightmare. Tidy source code is an unreadable abomination. Thousand+ line functions, poor variable naming, spaghetti code etc. It's truly horrible.

Worse yet, official development seems to have ceased. In the last 12 months, there have been three write transactions to the official CVS repo. But it's been dead and buried for much longer than that...

So I'm looking for an OSS C or C++ application/library that can do what Tidy can (when it feels like it): fix bad HTML markup and transform it into valid XHTML (this is the part I'm interested in). And I mean all sorts of bad markup.

Is there something like that out there?

EDIT: I need it both for manipulations on the DOM tree by an XML handling tool and for general compliance with the XHTML spec. My app needs to accept HTML from users (which is often invalid in all sorts of ways) and output valid XHTML. It needs to be able to handle even HTML that would normally not display in a browser because the user edited it by hand and didn't check afterwards.

A drop-in replacement for Tidy's error-correcting parser... that doesn't suck. I don't mind bugs if the source is readable and I can fix problems myself, or if there are active developers who provide bugfixes on a timely basis.

like image 768
Lucas Avatar asked Feb 21 '10 18:02

Lucas


3 Answers

Could you tell us what you plan to use this tool for? As in, do you want to fix static web pages, or do you want some sort of filtering step before other manipulations, so that some tool can handle buggy web pages?

Personally, I write my own tool atop Python's BeautifulSoup or lxml whenever I need to --- it's at most a dozen line script and does much of what I want.

like image 131
pavpanchekha Avatar answered Nov 15 '22 23:11

pavpanchekha


There is a new, nice, proper HTML 5 supporting Tidy, so the alternative to old, ugly Tidy would be Tidy (GitHub repository).

like image 44
Benjamin W. Avatar answered Nov 15 '22 22:11

Benjamin W.


Try Pretty Diff. It is a vastly superior beautification algorithm and it does not make any assumptions about your input.

http://prettydiff.com/?m=beautify&html

like image 1
austincheney Avatar answered Nov 16 '22 00:11

austincheney