Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why use dom to parse webpages instead of regex?

I've been searching for questions about finding contents in a page, and alot of answers recommend using DOM when parsing webpages instead of REGEX. Why is it so? Does it improve the processing time or something.

like image 355
Jürgen Paul Avatar asked Apr 04 '12 09:04

Jürgen Paul


People also ask

Is regex good for parsing?

Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.

Can you parse XML with regex?

XML is not a regular language (that's a technical term) so you will never be able to parse it correctly using a regular expression.

What is regex and parsing?

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.

Why is HTML not a regular language?

HTML, as a markup language doesn't really “do” anything in the sense that a programming language does. HTML contains no programming logic. It doesn't have common conditional statements such as If/Else. It can't evaluate expressions or do any math.


2 Answers

A DOM parser is actually parsing the page.

A regular expression is searching for text, not understanding the HTML's semantic meaning.

It is provable that HTML is not a regular language; therefore, it is impossible to create a regular expression that will parse all instances of an arbitrary element-pattern from an HTML document without also matching some text which is not an instance of that element-pattern.

You may be able to design a regular expression which works for your particular use case, but foreseeing exactly the HTML with which you'll be provided (and, consequently, how it will break your limited-use-case regex) is extremely difficult.

Additionally, a regex is harder to adapt to changes in a page's contents than an XPath expression, and the XPath is (in my mind) easier to read, as it need not be concerned with syntactic odds and ends like tag openings and closings.

So, instead of using the wrong tool for the job (a text parsing tool for a structured document) use the right tool for the job (an HTML parser for parsing HTML).

like image 95
Borealid Avatar answered Nov 14 '22 22:11

Borealid


I can't hear that "HTML is not a regular language ..." anymore. Regular expressions (as used in todays languages) also aren't regular.

The simple answer is:

A regular expression is not a parser, it describes a pattern and it will match that pattern, but it has no idea about the document structure. You can't parse anything with one regex. Of course regexes can be part of a parser, I don't know, but I assume nearly every parser will use regexes internally to find certain sub patterns.

If you can build that pattern for the stuff you want to find inside HTML, fine, use it. But very often you would not be able to create this pattern, because its practically not possible to cover all the corner cases, or dependencies like find all links but only if they are green and not pink.

In most cases its a lot easier to use a Parser, that understands the structure of your document, that accepts also a lot of "broken" HTML. It makes it so easy for you to access all links, or all table elements of a certain table, or ...

like image 37
stema Avatar answered Nov 14 '22 21:11

stema