Using regular expressions to parse HTML: why not?

Tags:

html-parsing

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML.

Why not? I'm aware that there are quote-unquote "real" HTML parsers out there like Beautiful Soup, and I'm sure they're powerful and useful, but if you're just doing something simple, quick, or dirty, then why bother using something so complicated when a few regex statements will work just fine?

Moreover, is there just something fundamental that I don't understand about regex that makes them a bad choice for parsing in general?

242

asked Feb 26 '09 14:02

ntownsend

9 Answers

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language and not a regular language (As @StefanPochmann pointed out, regular languages are also context-free, so context-free doesn't necessarily mean not regular). The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

113

answered Sep 23 '22 00:09

Johannes Weiss

For quick´n´dirty regexp will do fine. But the fundamental thing to know is that it is impossible to construct a regexp that will correctly parse HTML.

The reason is that regexps can’t handle arbitarly nested expressions. See Can regular expressions be used to match nested patterns?

answered Sep 21 '22 00:09

kmkaplan

(From http://htmlparsing.com/regexes)

Say you've got a file of HTML where you're trying to extract URLs from <img> tags.

<img src="http://example.com/whatever.jpg">

So you write a regex like this in Perl:

if ( $html =~ /<img src="(.+)"/ ) {
    $url = $1;
}

In this case, $url will indeed contain http://example.com/whatever.jpg. But what happens when you start getting HTML like this:

<img src='http://example.com/whatever.jpg'>

<img src=http://example.com/whatever.jpg>

<img border=0 src="http://example.com/whatever.jpg">

<img
    src="http://example.com/whatever.jpg">

or you start getting false positives from

<!-- // commented out
<img src="http://example.com/outdated.png">
-->

It looks so simple, and it might be simple for a single, unchanging file, but for anything that you're going to be doing on arbitrary HTML data, regexes are just a recipe for future heartache.

answered Sep 20 '22 00:09

Andy Lester

Two quick reasons:

writing a regex that can stand up to malicious input is hard; way harder than using a prebuilt tool
writing a regex that can work with the ridiculous markup that you will inevitably be stuck with is hard; way harder than using a prebuilt tool

Regarding the suitability of regexes for parsing in general: they aren't suitable. Have you ever seen the sorts of regexes you would need to parse most languages?

answered Sep 21 '22 00:09

Hank Gay

As far as parsing goes, regular expressions can be useful in the "lexical analysis" (lexer) stage, where the input is broken down into tokens. It's less useful in the actual "build a parse tree" stage.

For an HTML parser, I'd expect it to only accept well-formed HTML and that requires capabilities outside what a regular expression can do (they cannot "count" and make sure that a given number of opening elements are balanced by the same number of closing elements).

answered Sep 20 '22 00:09

Vatine

Because there are many ways to "screw up" HTML that browsers will treat in a rather liberal way but it would take quite some effort to reproduce the browser's liberal behaviour to cover all cases with regular expressions, so your regex will inevitably fail on some special cases, and that would possibly introduce serious security gaps in your system.

answered Sep 22 '22 00:09

Tamas Czinege

The problem is that most users who ask a question that has to do with HTML and regex do this because they can't find an own regex that works. Then one has to think whether everything would be easier when using a DOM or SAX parser or something similar. They are optimized and constructed for the purpose of working with XML-like document structures.

Sure, there are problems that can be solved easily with regular expressions. But the emphasis lies on easily.

If you just want to find all URLs that look like http://.../ you're fine with regexps. But if you want to find all URLs that are in a a-Element that has the class 'mylink' you probably better use a appropriate parser.

answered Sep 21 '22 00:09

okoman

Regular expressions were not designed to handle a nested tag structure, and it is at best complicated (at worst, impossible) to handle all the possible edge cases you get with real HTML.

answered Sep 19 '22 00:09

Peter Boughton

I believe that the answer lies in computation theory. For a language to be parsed using regex it must be by definition "regular" (link). HTML is not a regular language as it does not meet a number of criteria for a regular language (much to do with the many levels of nesting inherent in html code). If you are interested in the theory of computation I would recommend this book.

answered Sep 23 '22 00:09

taggers

Related questions
                            
                                Regex how to match an optional character
                            
                                Remove insignificant trailing zeros from a number?
                            
                                Regex - Should hyphens be escaped? [duplicate]
                            
                                How can I match a string with a regex in Bash?
                            
                                regex to match a single character that is anything but a space
                            
                                Add spaces before Capital Letters
                            
                                Fastest way to check a string contain another substring in JavaScript?
                            
                                How to filter rows in pandas by regex
                            
                                How do you implement a good profanity filter?
                            
                                RegEx for matching UK Postcodes
                            
                                Difference between \A \z and ^ $ in Ruby regular expressions
                            
                                Switch statement for string matching in JavaScript
                            
                                RegEx to extract all matches from string using RegExp.exec
                            
                                Why are regular expressions so controversial? [closed]
                            
                                Replace only some groups with Regex
                            
                                Regular expression to get a string between two strings in Javascript
                            
                                php Replacing multiple spaces with a single space [duplicate]
                            
                                Regular expression for matching HH:MM time format
                            
                                Regex that accepts only numbers (0-9) and NO characters [duplicate]
                            
                                Replacing all non-alphanumeric characters with empty strings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using regular expressions to parse HTML: why not?

Tags:

regex

html-parsing

ntownsend

People also ask

9 Answers

Johannes Weiss

kmkaplan

Andy Lester

Hank Gay

Vatine

Tamas Czinege

okoman

Peter Boughton

taggers

Recent Activity

Donate For Us