Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way transform custom XML like syntax

Using Python.

So basically I have a XML like tag syntax but the tags don't have attributes. So <a> but not <a value='t'>. They close regularly with </a>.

Here is my question. I have something that looks like this:

<al>
1. test
2. test2
 test with new line
3.  test3
<al>
    1. test 4
    <al>
        2. test 5
        3. test 6
        4. test 7
    </al>
</al>
4. test 8
</al>

And I want to transform it into:

<al>
<li>test</li>
<li> test2</li>
<li> test with new line</li>
<li>  test3
<al>
    <li> test 4 </li>
    <al>
        <li> test 5</li>
        <li> test 6</li>
        <li> test 7</li>
    </al>
    </li>
</al>
</li>
<li> test 8</li>
</al>

I'm not really looking for a completed solution but rather a push into the right direction. I am just wondering how the folks here would approach the problem. Solely REGEX? write a full custom parser for the attribute-less tag syntax? Hacking up existing XML parsers? etc.

Thanks in advance

like image 509
Peach Passion Avatar asked Jul 15 '11 18:07

Peach Passion


2 Answers

I'd recommend start with the following:

from xml.dom.minidom import parse, parseString

xml = parse(...)
l = xml.getElementsByTagName('al')

then traverse all elements in l, examining their text subnodes (as well as <al> nodes recursively).

You may start playing with this right away in the Python console.

It is easy to remove text nodes, then split text chunks with chunk.split('\n') and add <li> nodes back, as you need.

After modifying all the <al> nodes you may just call xml.toxml() to get the resulting xml as text.

Note that the element objects you get from this are linked back to the original xml document object, so do not delete the xml object in the process.

This way I personally consider more straightforward and easy to debug than mangling with multiline regexps.

like image 153
spacediver Avatar answered Sep 23 '22 21:09

spacediver


The way you've described your syntax, it is "XML without attributes". If that's so, it's still XML, so you can use XML tools such as XSLT and XQuery.

If you allow things that aren't allowed in XML, on the other hand, my approach would be to write a parser that handles your non-XML format and delivers XML-compatible SAX events. Then you'll be able to use any XML technology just by plugging in your parser in place of the regular XML parser.

like image 39
Michael Kay Avatar answered Sep 22 '22 21:09

Michael Kay