Is there a better approach to parse an invalid HTML then applying Tidy on it?
Side Note : There are some situation when you can't have Tidy available. Regexp is also not recommended I understood for parsing html.
HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.
HTML is a markup language with a simple structure. It would be quite easy to build a parser for HTML with a parser generator. Actually, you may not need even to do that, if you choose a popular parser generator, like ANTLR. That is because there are already available grammars ready to be used.
The parse error in CSS arises when the CSS parser detects something that does not comply with the requirements. Usually, a CSS parser demands CSS be written in a certain way. CSS parser has specific requirements that include: Adding a semicolon at the end of all CSS properties.
I would try something like this: http://php.net/manual/en/domdocument.loadhtml.php
From that page:
The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.
SimpleHTMLDOM is known to be more lenient than PHP's native DOM functions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With