Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP: parsing only namespaced xml

Tags:

dom

php

parsing

xml

I'm trying to parse data like this:

<vin:layout name="Page" xmlns:vin="http://www.example.com/vin">
    <header>
        {someText}
        <div>
            <!-- some invalid xml code -->
            <aas>
            <nav class="main">
                <vin:show section="Menu" />
            </nav>
        </div>
    </header>
</vin:layout>

How can I parse data like this in PHP?

I tried DOM but it not works, because of the malformed xml inside the root element. Can I tell the parser, that everithing without vin namespace is text?

like image 663
Pete Kirkham Avatar asked Dec 28 '12 23:12

Pete Kirkham


1 Answers

I probably would throw a sort of Tagsoup parser on it. Something that can read your format which apart from that deficiencies looks pretty okay written. Nothing that textually would stay in the way against a simple regular expression based scanner. I called mine Tagsoup with just the four node-types you got: Starttag, Endtag, Text and Comment. For the Tags you need to know about their Tagname and the NamespacePrefix. It's just named similar to XML/HTML for convienience, but in fact this is all "rool your own", so do not stretch these terms to any standards.

A usage to change every tag (starting or ending) that does not have the namespace prefix could look like ($string contains the data you have in your question):

$scanner = new TagsoupIterator($string);

$nsPrefix = 'vin';

foreach ($scanner as $node) {
    $isTag  = $node instanceof TagsoupTag;
    $isOfNs = $isTag && $node->getTagNsPrefix() === $nsPrefix;
    if ($isTag && !$isOfNs) {
        $node = strtr($node, ['&' => '&amp;', '<' => '&lt;']);
    }
    echo $node;
}

Output:

<vin:layout name="Page" xmlns:vin="http://www.example.com/vin">
    &lt;header>
        {someText}
        &lt;div>
            <!-- some invalid xml code -->
            &lt;aas>
            &lt;nav class="main">
                <vin:show section="Menu" />
            &lt;/nav>
        &lt;/div>
    &lt;/header>
</vin:layout>

A usage to extract everything inside a certain tag of a namespace could look like:

$scanner = new TagsoupIterator($string);
$parser  = new TagsoupForwardNavigator($scanner);

$startTagWithNsPrefix = function ($namespace) {

    return function (TagsoupNode $node) use ($namespace) {

        /* @var $node TagsoupTag */
        return $node->getType() === Tagsoup::NODETYPE_STARTTAG
            && $node->getTagNsPrefix() === $namespace;
    };
};

$start = $parser->nextCondition($startTagWithNsPrefix('vin'));
$tag   = $start->getTagName();
$parser->next();
echo $html = implode($parser->getUntilEndTag($tag));

Output:

<header>
    {someText}
    <div>
        <!-- some invalid xml code -->
        <aas>
        <nav class="main">
            <vin:show section="Menu" />
        </nav>
    </div>
</header>

Next part is to replace that part of the $string. As Tagsoup offers binary offsets and lengths, this is easy (and I shortcut a little dirty via SimpleXML):

$xml = substr($string, 0, $start->getEnd()) . substr($string, $parser->getOffset());
$doc = new SimpleXMLElement($xml);
$doc[0] = $html;
echo $doc->asXML();

Output:

<vin:layout xmlns:vin="http://www.example.com/vin" name="Page">
    &lt;header&gt;
        {someText}
        &lt;div&gt;
            &lt;!-- some invalid xml code --&gt;
            &lt;aas&gt;
            &lt;nav class="main"&gt;
                &lt;vin:show section="Menu" /&gt;
            &lt;/nav&gt;
        &lt;/div&gt;
    &lt;/header&gt;
</vin:layout>

Depending on the concrete needs this would require to change the implementation. For example this one won't allow to put the same tags into each other. It does not throw you out, however it does not handle that. No idea if you have that case, if so you would need to add some open/close counter, the navigator class could be easily extended for that, even to offer two kind of end-tag finding methods.

The examples given here are using the Tagsoup which you can see at this gist: https://gist.github.com/4415105

like image 136
hakre Avatar answered Oct 15 '22 15:10

hakre