Why is it such a bad idea to parse XML with regex? [closed]

Tags:

I was just reviewing a previous post I made and noticed a number of people suggesting that I don't use Regex to parse xml. In that case the xml was relatively simple, and Regex didn't pose any problems. I was also parsing a number of other code formats, so for the sake of uniformity it made sense. But I'm curious how this might pose a problem in other cases. Is this just a 'don't reinvent the wheel' type of issue?

439

asked Dec 20 '11 14:12

yatakaka

1 Answers

The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It's possible with balanced matching, but that's only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.

For example, this is a tricky one to parse...

<div>     <div id="parse-this">         <!-- oops</div> -->         try to get this value with regex     </div> </div>

You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there's no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.

180

answered Oct 17 '22 11:10

Steve Wortham

Related questions
                            
                                What is the performance penalty of XML data type in SQL Server when compared to NVARCHAR(MAX)?
                            
                                Construct XML with dynamic label and attributes in Scala?
                            
                                XPath expression to find elements whose tag name contains 'Name'
                            
                                How do you output the current element path in XSLT?
                            
                                Indentation with DOMDocument in PHP
                            
                                Is there a simpler/better way to put a border/outline around my TextView?
                            
                                tns appearing in Web Services schema
                            
                                Add line breaks in large XML file in one line
                            
                                Test if children tag exists in beautifulsoup
                            
                                DocumentBuilder.parse(InputStream) returns null
                            
                                JAXB, XJC -> create multiple class files
                            
                                XML Schema Validation : Cannot find the declaration of element
                            
                                Best way to process large XML in PHP [duplicate]
                            
                                ElementTree findall() returning empty list
                            
                                How do i designate in XSD that an element only contains CDATA?
                            
                                Search XML with a LIKE or similar full search operation
                            
                                how to remove carriage returns, newlines, spaces from a string
                            
                                DOMDocument getNodeValue() returns null (contains an output escaped string)
                            
                                Is there a way to include greater than or less than signs in an XML file?
                            
                                How do I specify XML serialization attributes to support namespace prefixes during deserialization in .NET?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is it such a bad idea to parse XML with regex? [closed]

Tags:

regex

xml

xml-parsing

yatakaka

People also ask

1 Answers

Steve Wortham

Recent Activity

Donate For Us