I've been searching for questions about finding contents in a page, and alot of answers recommend using <code>DOM</code> when parsing webpages instead of <code>REGEX</code>. Why is it so? Does it improve the processing time or something.

I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular. The simple answer is: A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns. If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink. In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...

Why use dom to parse webpages instead of regex?

Tags:

dom

regex

php

search

parsing

I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.

355

asked Apr 04 '12 09:04

Jürgen Paul

2 Answers

A DOM parser is actually parsing the page.

A regular expression is searching for text, not understanding the HTML's semantic meaning.

It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.

You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.

Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.

So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).

answered Nov 14 '22 22:11

Borealid

I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.

The simple answer is:

A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.

If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.

In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...

answered Nov 14 '22 21:11

stema

Related questions
                            
                                Where to place include statements in a PHP class file
                            
                                Amazon S3 files uploaded using AWS SDK for PHP is always “application/octet-stream”?
                            
                                Change the order of pictures at midnight
                            
                                PHP range for hebrew alphabets
                            
                                Check if a file is going to be uploaded? CodeIgniter
                            
                                Generate Nested UL's Based Upon Variable Depth Data
                            
                                Weighted Shuffle of an Array or Arrays?
                            
                                WAMPserver - why is the stack installed with 2 php.ini files?
                            
                                Can't execute java program with php exec function
                            
                                Updating database on __destruct()?
                            
                                PHP Gettext: how to change the default MO path after setting the path of the domain?
                            
                                is it possible to include php in html on IIS7?
                            
                                Java Timestamp and PHP Timestamp giving 2 different times
                            
                                How do you create a string to match an regex?
                            
                                Merging two arrays, overwriting first array with second one
                            
                                php with readline support compiled for windows
                            
                                Are all uncaught exceptions fatal in PHP?
                            
                                CakePHP 2.1 JsonView
                            
                                Getting image url from RSS feed using simplepie
                            
                                Do current browsers accept cookies from an ajax response?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With