We get a lot of xml data from various sources. The utf is 8.
We notice that some have what appears to be double encoding of the &
. &&
within the tag for A & B
comes in as A & B
. (Corrected from original posting was &&)
This causes some grief as most of the XML components do not like it.
Is it valid? What is the best way of remove these? We use VB.Net 2008
&
is "valid", though whether you want to use it is another question.
If you're writing a document in XML, then &
will be used to represent an ampersand. If your XML document is describing content that itself is encoded in a similar way -- e.g. HTML -- then that content could logically include an &
itself. This could lead to a &
in the XML.
For example, let's say you have XML describing a set of users, including a "signature" field that supports HTML:
<users>
<user username="jsmith" ...>
...
<signature type="text/html">
John Smith's Heating And Plumbing
</signature>
</users>
If John Smith wanted to use a &
instead of And
in his signature, it would be...
<signature type="text/html">
John Smith's Heating & Plumbing
</signature>
...where the &
is encoded as &
to keep the XML parser happy.
Think of the situation in which the signature is being included in an HTML email. The XML parser will decode &
into &
. If the signature is being dumped directly into the email, this will result in a "&" entity appearing unescaped in the message's source.
However, if the XML had included &amp;
, upon XML parsing it would become &
. Then it would be included in the email as properly-escaped HTML.
A more readable way of accomplishing the same escaping might be this...
<signature type="text/html">
<![CDATA[John Smith's Heating & Plumbing]]>
</signature>
The above presumes that the signature is to include HTML-encoded entities, which are further encoded into the XML documented. This is the source of the apparent double-encoding. If, for example, the signature should only include plain text, then there would only be a single encoding -- &
from the plain-text into &
for the XML document. Thus both &amp;
and &
are "valid" from an XML perspective, and in practice it will depend on the specification for the data to be encoded into the XML document.
(A third option when writing the XML schema would be to use XML namespacing to permit contained HTML to be included without double-encoding; this would have the added benefit of permitting it to be validated, but in practice applying strict XML-style validation to HTML content is a headache. See e.g. the failed attempt to promote and standardize on XHTML.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With