I'm trying to parse data like this:
<vin:layout name="Page" xmlns:vin="http://www.example.com/vin">
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
</vin:layout>
How can I parse data like this in PHP?
I tried DOM but it not works, because of the malformed xml inside the root element. Can I tell the parser, that everithing without vin
namespace is text?
I probably would throw a sort of Tagsoup parser on it. Something that can read your format which apart from that deficiencies looks pretty okay written. Nothing that textually would stay in the way against a simple regular expression based scanner. I called mine Tagsoup
with just the four node-types you got: Starttag, Endtag, Text and Comment. For the Tags you need to know about their Tagname and the NamespacePrefix. It's just named similar to XML/HTML for convienience, but in fact this is all "rool your own", so do not stretch these terms to any standards.
A usage to change every tag (starting or ending) that does not have the namespace prefix could look like ($string
contains the data you have in your question):
$scanner = new TagsoupIterator($string);
$nsPrefix = 'vin';
foreach ($scanner as $node) {
$isTag = $node instanceof TagsoupTag;
$isOfNs = $isTag && $node->getTagNsPrefix() === $nsPrefix;
if ($isTag && !$isOfNs) {
$node = strtr($node, ['&' => '&', '<' => '<']);
}
echo $node;
}
Output:
<vin:layout name="Page" xmlns:vin="http://www.example.com/vin">
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
</vin:layout>
A usage to extract everything inside a certain tag of a namespace could look like:
$scanner = new TagsoupIterator($string);
$parser = new TagsoupForwardNavigator($scanner);
$startTagWithNsPrefix = function ($namespace) {
return function (TagsoupNode $node) use ($namespace) {
/* @var $node TagsoupTag */
return $node->getType() === Tagsoup::NODETYPE_STARTTAG
&& $node->getTagNsPrefix() === $namespace;
};
};
$start = $parser->nextCondition($startTagWithNsPrefix('vin'));
$tag = $start->getTagName();
$parser->next();
echo $html = implode($parser->getUntilEndTag($tag));
Output:
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
Next part is to replace that part of the $string
. As Tagsoup offers binary offsets and lengths, this is easy (and I shortcut a little dirty via SimpleXML):
$xml = substr($string, 0, $start->getEnd()) . substr($string, $parser->getOffset());
$doc = new SimpleXMLElement($xml);
$doc[0] = $html;
echo $doc->asXML();
Output:
<vin:layout xmlns:vin="http://www.example.com/vin" name="Page">
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
</vin:layout>
Depending on the concrete needs this would require to change the implementation. For example this one won't allow to put the same tags into each other. It does not throw you out, however it does not handle that. No idea if you have that case, if so you would need to add some open/close counter, the navigator class could be easily extended for that, even to offer two kind of end-tag finding methods.
The examples given here are using the Tagsoup which you can see at this gist: https://gist.github.com/4415105
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With