Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Special characters in XML files - processing with the DOM API

I have a file, which is in XML format (consists just of root start and end tags, and children of the root). The text elements of the children contain the ampersand symbol &. In XML it is not allowed to have this symbol in order the document to be valid, and when I tried to process the file using the DOM API in Java and an XML parser, I obtained parsing errors. Therefore, I have replaced & with &, and I processed the file successfully: I had to extract the values of the text elements in different plain text files.

When I opened these newly created text files, I expected to see &, but there was & instead. Why is this? I have stored the text in text files without any extension (my original file with the XML format also did not have .xml extension), and I do have just & in the text of the new file, no matter how I open the file: as txt or as xml file (these are some of the options in my XML editor). What happens exactly? Does Java (?) convert & to & automatically? Or there is some default encoding? Well, & stands for &, and I suppose there is some "invisible" automatic conversion, but I am confused when and how this happens. Here are examples of my original file and the extracted file which I receive after I processed the original file with Java:

This is my "negative.review" file in XML format:

<review>
<review_text>
I will not wear it as it is too big &amp; looks funny on me. 
</review_text>
</review>

This is my extracted file "negative_1":

I will not wear it as it is too big & looks funny on me. 

For me it is important to have the original data as it is (without doing any conversions/replacements), so I thought that I have to process the extracted file "negative_1" converting back &amp; to &. As you see, it seems I don't have to do this. But I don't understand why :(.

Thank you in advance!

like image 452
user42155 Avatar asked May 16 '09 08:05

user42155


People also ask

How do you pass ampersand in XML string?

Use &amp; in place of & .

What are the invalid characters in XML?

The only illegal characters are & , < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use &quot; here, ' is allowed" and attr='must use &apos; here, " is allowed' ). They're escaped using XML entities, in this case you want &amp; for & .

How do you specify special characters in XML?

To include special characters inside XML files you must use the numeric character reference instead of that character. The numeric character reference must be UTF-8 because the supported encoding for XML files is defined in the prolog as encoding="UTF-8" and should not be changed.


2 Answers

The reason is simple: The XML file really contains an "&" character.

It is just represented differently (i.e. it is "escaped"), because a real "&" on it's own breaks XML files, as you've seen. Read the relevant section in the XML 1.0 spec: "2.4 Character Data and Markup". It's just a few lines, but it explains the issue quite well.

XML is a representation of data (!). Don't think of it as a text file. Example:

You want to store the string "17 < 20" in an XML file. Initially, you can't, since the "<" is reserved as the opening tag bracket. So this would be invalid:

<xml>17 < 20</xml>

Solution: You employ character escaping on the special/reserved character, just for the means of retaining the validity of the file:

<xml>17 &lt; 20</xml>

For all practical purposes the above snippet contains the following data (in JSON representation this time):

{
  "xml": "17 < 20"
}

This is why you see the real "&" in your post-processing. It had been escaped in just the same way, but it's meaning stayed the same all the time.

The above example also explains why the "&" must be treated specially: It is itself part of the XML escaping mechanism. It marks the start of an escape sequence, like in "&lt;". Therefore it must be escaped itself (with "&amp;", like you've done).

like image 69
Tomalak Avatar answered Nov 15 '22 10:11

Tomalak


Any XML parser will implicitly translate entities such as &amp;, &lt;, &gt;, into the corresponding characters, as part of the process of parsing the file.

like image 42
Alex Martelli Avatar answered Nov 15 '22 10:11

Alex Martelli