I am looking for algorithms & data structures one would use to fix broken HTML. I know lots of inbuilt tools exist in every language to do this. But I want to learn this. Some approaches I can think of is -
UPDATE: This is more of a general discussion I am expecting. But if you refer to any tools in C, C++, Python or Java is fine by me.
thanks
Parse the markup using the HTML 5 parsing algorithm (which is designed to handle brokenness), and build a DOM from it. You can then serialize back to HTML.
RegEx + HTML = disaster.
There are just too many ways for HTML to be valid SGML yet break RegEx rules.
Really you need stateful SGML parsers. You don't mention what languages you're willing to work in, but there are many stateful SGML parsers out there.
In .NET we regularly use SGMLReader - a stateful parser that returns wellformed DOM and/or XML DOM.
In C, W3C has a reasonable C SGML Parser
In Java there is a SAX-style SGML parser
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With