Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rationale behind SimpleXMLElement's handling of text values in addChild and addAttribute

Tags:

php

xml

Isn't that an inconsistent behavior? (PHP 5.2.6)

<?php

$a = new SimpleXMLElement('<a/>');

$a->addAttribute('b', 'One & Two');
//$a->addChild('c', 'Three & Four'); -- results in "unterminated entity reference" warning!
$a->addChild('c', 'Three &amp; Four');
$a->d = 'Five & Six';

print($a->asXML());

Renders:

<?xml version="1.0"?>
<a b="One &amp; Two">
    <c>Three &amp; Four</c>
    <d>Five &amp; Six</d>
</a>

At bugs.php.net they reject all the submissions about that, saying it's a feature. Why could that possibly be? BTW, there's nothing in the docs about that discrepancy of escaping text values by SimpleXMLElement.

Can anyone convince me it's the best API design decision possible?

like image 800
Ivan Krechetov Avatar asked Feb 16 '09 11:02

Ivan Krechetov


5 Answers

Just to make sure we're on the same page, you have three situations.

  1. The insertion of an ampersand into an attribute using addAttribute

  2. The insertion of an ampersand into an element using addChild

  3. The insertion of an ampersand into an element by property overloading

It's the discrepancy between 2 and 3 that has you flummoxed. Why does addChild not automatically escape the ampersand, whereas adding a property to the object and setting its value does escape the ampersand automatically?

Based on my instincts, and buoyed by this bug, this was a deliberate design decision. The property overloading ($a->d = 'Five & Six';) is intended to be the "escape ampersands for me" way of doing things. The addChild method is meant to be "add exactly what I tell you to add" method. So, whichever behavior you need, SimpleXML can accommodate you.

Let's say you had a database of text where all the ampersands were already escaped. The auto-escaping wouldn't work for you here. That's where you'd use addChild. Or lets say you needed to insert an entity in your document

$a = simplexml_load_string('<root></root>');
$a->b = 'This is a non-breaking space &nbsp;';
$a->addChild('c','This is a non-breaking space &nbsp;');    
print $a->asXML();

That's what the PHP Developer in that bug is advocating. The behavior of addChild is meant to provide a "less simple, more robust" support when you need to insert a ampersand into the document without it being escaped.

Of course, this does leave us with the first situation I mentioned, the addAttribute method. The addAttribute method does escape ampersands. So, we might now state the inconsistency as

  1. The addAttribute method escapes ampersands
  2. The addChild method does not escape ampersands
  3. This behavior is somewhat inconsistent. It's reasonable that a user would expect the methods on SimpleXML to escape things in a consistent way

This then exposes the real problem with the SimpleXML api. The ideal situation here would be

  1. Property Overloading on Element Objects escapes ampersands
  2. Property Overloading on Attribute Objects escapes ampersands
  3. The addChild method does not escape ampersands
  4. the addAttribute method does not escape ampersands

This is impossible though, because SimpleXML has no concept of an Attribute Object. The addAttribute method is (appears to be?) the only way to add an attribute. Because of that, it turns out (seems?) SimpleXML in incapable of creating attributes with entities.

All of this reveals the paradox of SimpleXML. The idea behind this API was to provide a simple way of interacting with something that turns out to be complex.

The team could have added a SimpleXMLAttribute Object, but that's an added layer of complexity. If you want a multiple object hierarchy, use DomDoument.

The team could have added flags to the addAttribute and addChild methods, but flags make the API more complex.

The real lesson here? Maybe it's that simple is hard, and simple on a deadline is even harder. I don't know if this was the case or not, but with SimpleXML it seems like someone started with a simple idea (use property overloading to make the creation of XML documents easy), and then adjusted as the problems/feature requests came in.

Actually, I think the real lesson here is to just use JSON ;)

like image 132
Alan Storm Avatar answered Nov 17 '22 23:11

Alan Storm


This is my solution, especially this solves adding several childs with the same tag-name

$job->addChild('industrycode')->{0} = $entry1;
$job->addChild('industrycode')->{0} = $entry2;
$job->addChild('industrycode')->{0} = $entry3;
like image 32
Mathias Weitz Avatar answered Nov 18 '22 00:11

Mathias Weitz


"Let's say you had a database of text where all the ampersands were already escaped."

If you're doing this, you're doing it wrong. Data should be stored in its most accurate form, not munged for whatever type of output you're currently using. This is even worse if you actually store blobs of (valid) HTML in the database. Using addChild() and grabbing the data out again will destroy your HTML; no sensible library exhibits such horrible asymmetry.

addChild() not encoding your text for you is completely counter-intuitive. What is the point in an API that doesn't protect you from this? It's like json_encode() barfing if you use a double quote in one of your values.

Anyway, to answer the original question: Obviously, I too think it's not a good decision. I do think it's consistent with a lot of PHP's design decisions, which is to fulfill someone's idea of what is "quicker", rather than being correct.

like image 10
Daniel Avatar answered Nov 17 '22 22:11

Daniel


The requirement for escaping the characters & and < is provided in the section Character Data and Markup and not in the section Attribute-Value Normalization, as the previous answer states.

To quote the XML Spec.:

"The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings &amp; and &lt; respectively"

like image 7
Dimitre Novatchev Avatar answered Nov 18 '22 00:11

Dimitre Novatchev


Alan Storm had a nice description of the issue, however there's an easy solution to the paradox he describes. The addChild() method could have an optional boolean parameter that determines whether to automatically escape characters. So, I'm still convinced that it's simply a (very) poor design choice.

The confusion is compounded by the fact the the documentation for the addChild() method makes no reference whatsoever so the issue (although is is in the discussion). Furthermore, the method escapes some characters (namely the less than and greater than signs). This will mislead developers using the method to believe that it escapes characters in general.

like image 6
Graham Lexie Avatar answered Nov 17 '22 23:11

Graham Lexie