Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Look for Nested XML tag with Regex

This is my first post here, hoping to get some response. I've read through few similar posts and consensus is not to try parsing xml/html with regex but what I'm asking seems to be easier than the ones on other postings, so i'm giving it a shot.

I'm trying to find all the nested tags, here are some examples I want to catch: <a><a></a></a>

I don't want to catch <a></a><a></a>

So in plain english I want to catch all <a> following other <a> without having </a> in between them..and I want to look though the entire string so i should proceed even it sees a newline or linebreak

Hoping to have this problem solved. Thanks all!

like image 447
Gugg Avatar asked Jan 13 '23 16:01

Gugg


2 Answers

I hope you are ready for parsing XML with regex.


First of all, let's define what XML tags would look like!

<tag_name␣(optional space (then whatever that doesnt end with "/"))>(whatever)</␣(optional space)tag_name>
<tag_name␣(optional space)/>

To match one of these tags we can then use the following regex:

/<[^ \/>]++ ?\/>|<([^ \>]++) ?[^>]*+>.*?<\/ ?\1>/s

Obviously, no tags are going to nest within our second type of XML tag. So our two-level nested regex would then be:

/<([^ \>]++) ?[^>]*+>.*?(?:<([^ \>]++) ?[^>]*+>.*?<\/ ?\2>|<[^ \/>]++ ?\/>).*?<\/ ?\1>/s

Now let's apply some recursion magic (Hopefully your regex engine supports recursion (and doesn't crash yet)):

/<([^ \>]++) ?[^>]*+>(.*?(?:<([^ \>]++) ?[^>]*+>(?:[^<]*+|(?2))<\/ ?\3>|<[^ \/>]++ ?\/>).*?)<\/ ?\1>/s

Done - The regex should do.

No seriously, try it out.

I stole an XML file fragment from w3schools XML tutorial and tried it with my regex, I copied a Maven project .xml from aliteralmind's question and tried it with my regex as well. Works best with heavily nested elements.

img
(source: gyazo.com)

Cheers.

like image 89
Unihedron Avatar answered Jan 19 '23 04:01

Unihedron


If you want a 100% correct solution, for example one that works with arbitrary content in comments and CDATA sections and in internal/external entities, and with author-chosen namespace prefixes, then it can't be done with regular expressions.

And since a 100% correct solution is very easy to achieve with XSLT, I think you are using the wrong technology.

No doubt you can achieve an acceptably high hit rate with regular expressions if you're prepared to put enough work in, but the details depend on aspects of the specification that you haven't made clear: for example, what you want to do with the nested elements that you find, and whether you want to locate elements nested 3-deep or 4-deep as well as those nested 2-deep.

like image 30
Michael Kay Avatar answered Jan 19 '23 03:01

Michael Kay