Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithms to fix a broken HTML

I am looking for algorithms & data structures one would use to fix broken HTML. I know lots of inbuilt tools exist in every language to do this. But I want to learn this. Some approaches I can think of is -

  1. Using Regular Expressions - seems like a naive approach
  2. Create DOM - but how would DOM tree get created with broken html?

UPDATE: This is more of a general discussion I am expecting. But if you refer to any tools in C, C++, Python or Java is fine by me.

thanks

like image 639
Srikar Appalaraju Avatar asked Apr 07 '26 18:04

Srikar Appalaraju


2 Answers

Parse the markup using the HTML 5 parsing algorithm (which is designed to handle brokenness), and build a DOM from it. You can then serialize back to HTML.

like image 179
Quentin Avatar answered Apr 10 '26 14:04

Quentin


RegEx + HTML = disaster.

There are just too many ways for HTML to be valid SGML yet break RegEx rules.

Really you need stateful SGML parsers. You don't mention what languages you're willing to work in, but there are many stateful SGML parsers out there.

In .NET we regularly use SGMLReader - a stateful parser that returns wellformed DOM and/or XML DOM.

In C, W3C has a reasonable C SGML Parser

In Java there is a SAX-style SGML parser

like image 39
stephbu Avatar answered Apr 10 '26 14:04

stephbu