Looking at the XML header <pre class="prettyprint"><code><?xml version="1.0" encoding="UTF-16" standalone="no"?> </code></pre> Am I right to state that the <code>encoding</code> attribute is <ul> <li>coming too late (you can't read it properly unless you know the encoding...)</li> <li>redundant, hence error-prone: it's all too easy to replace it with "Big5" yet save the file in UTF-8</li> </ul> Or is that attribute not about the content of the stream? Am I mixing up things here?

As you mentioned, you'd have to know the encoding of the file to read the <code>encoding</code> attribute. However, there is a heuristic that can easily get you close enough to the "real" encoding to allow you to read the encoding attribute. This works, because the <code><?xml</code> part by definition can only contain characters in the ASCII range (however they are encoded). The XML standard even describes the exact process used to find out the encoding. And the encoding label isn't redundant either. For example, if you use the algorithm in the XML spec to find out that some ASCII-based (or ASCII-compatible) encoding is used you still need to read the encoding to find out which one is actually use (valid candidates would be ASCII, UTF-8, any of the ISO-8859-* encodings, any of the Windows-* encodings, KOI8-R and many, many others). For the <code><?xml</code> part itself it won't make a difference which one it is, but for the rest of the document, it can make a huge difference. Regarding mis-labeled XML files: yes, it's easy to produce those, however: the XML spec clearly specifies that those files are mal-formed and as such are not correct XML. Incorrect encodings must be reported as an error (as long as they can be detected!). So it's the problem of whoever is producing the XML.

What use is the 'encoding' in the XML header?

Tags:

character-encoding

xml

header

Looking at the XML header

<?xml version="1.0" encoding="UTF-16" standalone="no"?>

Am I right to state that the encoding attribute is

coming too late (you can't read it properly unless you know the encoding...)
redundant, hence error-prone: it's all too easy to replace it with "Big5" yet save the file in UTF-8

Or is that attribute not about the content of the stream?

Am I mixing up things here?

735

asked Mar 02 '11 09:03

xtofl

1 Answers

As you mentioned, you'd have to know the encoding of the file to read the encoding attribute.

However, there is a heuristic that can easily get you close enough to the "real" encoding to allow you to read the encoding attribute. This works, because the <?xml part by definition can only contain characters in the ASCII range (however they are encoded).

The XML standard even describes the exact process used to find out the encoding.

And the encoding label isn't redundant either. For example, if you use the algorithm in the XML spec to find out that some ASCII-based (or ASCII-compatible) encoding is used you still need to read the encoding to find out which one is actually use (valid candidates would be ASCII, UTF-8, any of the ISO-8859-* encodings, any of the Windows-* encodings, KOI8-R and many, many others). For the <?xml part itself it won't make a difference which one it is, but for the rest of the document, it can make a huge difference.

Regarding mis-labeled XML files: yes, it's easy to produce those, however: the XML spec clearly specifies that those files are mal-formed and as such are not correct XML. Incorrect encodings must be reported as an error (as long as they can be detected!). So it's the problem of whoever is producing the XML.

answered Oct 11 '22 15:10

Joachim Sauer

Related questions
                            
                                XDocument containing namespaces
                            
                                Exclude specific tag from selection in XPath
                            
                                How to check for valid xml in string input before calling .LoadXml()
                            
                                Open XML SDK 2.0 - how to update a cell in a spreadsheet?
                            
                                XML Entity for "/"?
                            
                                How to output CDATA using ElementTree
                            
                                Using XPath in ElementTree
                            
                                How to calculate the XPath position of an element using Javascript?
                            
                                How to decode string to XML string in C#
                            
                                Inspect XML created by PHP SoapClient call before/without sending the request
                            
                                Way to quickly check if string is XML or JSON in C#
                            
                                Reading keyvalue pairs into dictionary from app.config configSection
                            
                                SQL Server SELECT to JSON function
                            
                                Deserializing XML from String
                            
                                PHP XML Extension: Not installed
                            
                                Post UTF-8 encoded data to server loses certain characters
                            
                                strip decimal points from variable
                            
                                How set alpha/opacity value to color on xml drawable?
                            
                                How to create the delay of 1 sec before set the alpha of View?
                            
                                NSXMLParser Simple Example

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With