I was just reviewing a previous post I made and noticed a number of people suggesting that I don't use Regex to parse xml. In that case the xml was relatively simple, and Regex didn't pose any problems. I was also parsing a number of other code formats, so for the sake of uniformity it made sense. But I'm curious how this might pose a problem in other cases. Is this just a 'don't reinvent the wheel' type of issue?
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.
No, it is not possible: regular expression language allows parenthesized expressions representing capturing and non-capturing groups, lookarounds, etc., where parentheses must be balanced.
The most common cause is encoding errors. There are several basic approaches to solving this: escaping problematic characters ( < becomes < , & becomes & , etc.), escaping entire blocks of text with CDATA sections, or putting an encoding declaration at the start of the feed.
The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It's possible with balanced matching, but that's only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.
For example, this is a tricky one to parse...
<div> <div id="parse-this"> <!-- oops</div> --> try to get this value with regex </div> </div>
You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there's no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With