I have a file, which is in XML format (consists just of root start and end tags, and children of the root). The text elements of the children contain the ampersand symbol &. In XML it is not allowed to have this symbol in order the document to be valid, and when I tried to process the file using the DOM API in Java and an XML parser, I obtained parsing errors. Therefore, I have replaced & with <code>&amp;</code>, and I processed the file successfully: I had to extract the values of the text elements in different plain text files. When I opened these newly created text files, I expected to see <code>&amp;</code>, but there was & instead. Why is this? I have stored the text in text files without any extension (my original file with the XML format also did not have .xml extension), and I do have just & in the text of the new file, no matter how I open the file: as txt or as xml file (these are some of the options in my XML editor). What happens exactly? Does Java (?) convert <code>&amp;</code> to & automatically? Or there is some default encoding? Well, <code>&amp;</code> stands for &, and I suppose there is some "invisible" automatic conversion, but I am confused when and how this happens. Here are examples of my original file and the extracted file which I receive after I processed the original file with Java: This is my "negative.review" file in XML format: <pre class="prettyprint"><code><review> <review_text> I will not wear it as it is too big &amp; looks funny on me. </review_text> </review> </code></pre> This is my extracted file "negative_1": <pre class="prettyprint"><code>I will not wear it as it is too big & looks funny on me. </code></pre> For me it is important to have the original data as it is (without doing any conversions/replacements), so I thought that I have to process the extracted file "negative_1" converting back <code>&amp;</code> to &. As you see, it seems I don't have to do this. But I don't understand why :(. Thank you in advance!

Any XML parser will implicitly translate entities such as <code>&amp;</code>, <code>&lt;</code>, <code>&gt;</code>, into the corresponding characters, as part of the process of parsing the file.

Special characters in XML files - processing with the DOM API

Tags:

dom

xml

special-characters

I have a file, which is in XML format (consists just of root start and end tags, and children of the root). The text elements of the children contain the ampersand symbol &. In XML it is not allowed to have this symbol in order the document to be valid, and when I tried to process the file using the DOM API in Java and an XML parser, I obtained parsing errors. Therefore, I have replaced & with &, and I processed the file successfully: I had to extract the values of the text elements in different plain text files.

When I opened these newly created text files, I expected to see &, but there was & instead. Why is this? I have stored the text in text files without any extension (my original file with the XML format also did not have .xml extension), and I do have just & in the text of the new file, no matter how I open the file: as txt or as xml file (these are some of the options in my XML editor). What happens exactly? Does Java (?) convert & to & automatically? Or there is some default encoding? Well, & stands for &, and I suppose there is some "invisible" automatic conversion, but I am confused when and how this happens. Here are examples of my original file and the extracted file which I receive after I processed the original file with Java:

This is my "negative.review" file in XML format:

<review>
<review_text>
I will not wear it as it is too big &amp; looks funny on me. 
</review_text>
</review>

This is my extracted file "negative_1":

I will not wear it as it is too big & looks funny on me.

For me it is important to have the original data as it is (without doing any conversions/replacements), so I thought that I have to process the extracted file "negative_1" converting back & to &. As you see, it seems I don't have to do this. But I don't understand why :(.

Thank you in advance!

452

asked May 16 '09 08:05

user42155

2 Answers

The reason is simple: The XML file really contains an "&" character.

It is just represented differently (i.e. it is "escaped"), because a real "&" on it's own breaks XML files, as you've seen. Read the relevant section in the XML 1.0 spec: "2.4 Character Data and Markup". It's just a few lines, but it explains the issue quite well.

XML is a representation of data (!). Don't think of it as a text file. Example:

You want to store the string "17 < 20" in an XML file. Initially, you can't, since the "<" is reserved as the opening tag bracket. So this would be invalid:

<xml>17 < 20</xml>

Solution: You employ character escaping on the special/reserved character, just for the means of retaining the validity of the file:

<xml>17 &lt; 20</xml>

For all practical purposes the above snippet contains the following data (in JSON representation this time):

{
  "xml": "17 < 20"
}

This is why you see the real "&" in your post-processing. It had been escaped in just the same way, but it's meaning stayed the same all the time.

The above example also explains why the "&" must be treated specially: It is itself part of the XML escaping mechanism. It marks the start of an escape sequence, like in "<". Therefore it must be escaped itself (with "&", like you've done).

answered Nov 15 '22 10:11

Tomalak

Any XML parser will implicitly translate entities such as &, <, >, into the corresponding characters, as part of the process of parsing the file.

answered Nov 15 '22 10:11

Alex Martelli

Related questions
                            
                                Unrecognized attribute 'xmlns' in custom .config file
                            
                                android cordova phonegap config.xml unbound prefix
                            
                                How to resolve a "java.lang.InstantiationException"?
                            
                                xml to json with attributes for php or python
                            
                                How to change the xml class name using fasterxml jackson?
                            
                                How to give user permissions programmatically?
                            
                                How to let Java.xml.Transformer output a xml without any useless space or line break?
                            
                                Yii2 render response a xml file in the view
                            
                                How to export data from database to xml according the XSD
                            
                                Error:(218) Apostrophe not preceded by \
                            
                                Navigation drawer menu item with titles and sub titles
                            
                                Parse a soap XML to a C# class
                            
                                android error on tutorial cannot find symbol variable activity_display_message
                            
                                extract text between xml tags in python
                            
                                How to make the icon background of an Android app transparent?
                            
                                how can I shape Circular the selected image from gallery
                            
                                How to manage concurrent Input/Output access to a XML file from multiple instances of an EXE, using Delphi.
                            
                                Groovy parsing JSON vs XML
                            
                                Better way to cleanly handle nested XML with LINQ
                            
                                What's the difference between the W3 and xmlsoap.org schemas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With