Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to parse an invalid HTML in PHP

Tags:

html

php

parsing

Is there a better approach to parse an invalid HTML then applying Tidy on it?

Side Note : There are some situation when you can't have Tidy available. Regexp is also not recommended I understood for parsing html.

like image 761
johnlemon Avatar asked Aug 31 '10 07:08

johnlemon


People also ask

Can HTML be parsed with regex?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Can we parse HTML?

HTML is a markup language with a simple structure. It would be quite easy to build a parser for HTML with a parser generator. Actually, you may not need even to do that, if you choose a popular parser generator, like ANTLR. That is because there are already available grammars ready to be used.

What is parse error in HTML?

The parse error in CSS arises when the CSS parser detects something that does not comply with the requirements. Usually, a CSS parser demands CSS be written in a certain way. CSS parser has specific requirements that include: Adding a semicolon at the end of all CSS properties.


2 Answers

I would try something like this: http://php.net/manual/en/domdocument.loadhtml.php

From that page:

The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.

like image 102
Rob Avatar answered Sep 18 '22 02:09

Rob


SimpleHTMLDOM is known to be more lenient than PHP's native DOM functions.

like image 42
Pekka Avatar answered Sep 17 '22 02:09

Pekka