Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it such a bad idea to parse XML with regex? [closed]

I was just reviewing a previous post I made and noticed a number of people suggesting that I don't use Regex to parse xml. In that case the xml was relatively simple, and Regex didn't pose any problems. I was also parsing a number of other code formats, so for the sake of uniformity it made sense. But I'm curious how this might pose a problem in other cases. Is this just a 'don't reinvent the wheel' type of issue?

like image 439
yatakaka Avatar asked Dec 20 '11 14:12

yatakaka


People also ask

Why is it bad to parse HTML with regex?

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Can you parse regex with regex?

No, it is not possible: regular expression language allows parenthesized expressions representing capturing and non-capturing groups, lookarounds, etc., where parentheses must be balanced.

What causes XML parsing error?

The most common cause is encoding errors. There are several basic approaches to solving this: escaping problematic characters ( < becomes &lt; , & becomes &amp; , etc.), escaping entire blocks of text with CDATA sections, or putting an encoding declaration at the start of the feed.


1 Answers

The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It's possible with balanced matching, but that's only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.

For example, this is a tricky one to parse...

<div>     <div id="parse-this">         <!-- oops</div> -->         try to get this value with regex     </div> </div> 

You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there's no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.

like image 180
Steve Wortham Avatar answered Oct 17 '22 11:10

Steve Wortham