This article argues that regular expressions cannot match nested structures because regexes are finite automatons.
He then offers a list of problems in which the answer states that the following cannot be solved using regexes:
Since 2 & 3 can conceivably contain brackets; this nesting is unsolvable for regexes. But why is it impossible to match an XML element ? (He didn't provide examples).
XML is not a regular language (that's a technical term) so you will never be able to parse it correctly using a regular expression.
HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.
XML schema always implicitly anchors the entire regular expression. The regex must match the whole element for the element to be considered valid. If you have the pattern regexp, the XML schema validator will apply it in the same way as say Perl, Java or . NET would do with the pattern ^regexp$.
You can match a limited subset of HTML tags, if you know in advance the tags to be matched.
But you can't (reliably or nicely) parse arbitrary HTML. It is not a regular language.
How would you match this valid XML with regex?
<!--<d>>--<<--><div class='foo' id="bar" inline></div>
It's like making a wooden car. Sure you can try to do it, but why?
But then comes the part of parsing the XML. How would you extract a set of possibly infinite attributes from an infinite set of elements using a finite set of groups? It's just not possible due to the nature and structure of regex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With