Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XML declaration encoding

Tags:

xml

What does it actually do? On my very basic level of understanding XML is just a formatted text. So there is no binary<->text transformation involved.

I highly suspect that the only difference between UTF-8 and ASCII encoding is that ASCII encoding will make XML writer work harder by converting all the non-ASCII characters into XML entities as opposed to just reserved XML characters. So ASCII encoded XML can still contain UTF-8 characters, except it is going to be slightly longer and uglier.

Or is there some other function to it?

Update:

I perfectly understand how individual characters are converted into byte(s) by means of encoding. However XML is just text markup and at no point does that.

The question really is why XML encoding value is stored in the XML? Or what is the case where XML reader would need to know which encoding was used for any particular XML document?

like image 536
Ilia G Avatar asked Dec 10 '22 05:12

Ilia G


2 Answers

See Appendix F in the XML specification, "Autodetection of Character Encodings".

In particular, "XML encoding value is stored in the XML" because, by default, XML processors must assume the content is in UTF-16 or UTF-8, in the absence of external metadata found outside of the XML document. The XML declaration is designed for such cases where such metadata is not present.

Another advantage to how XML handles encodings is that this way, an XML processor need support only two encodings, namely UTF-8 and UTF-16. If the processor discovers, either in external metadata or in the XML declaration, that the document is in an encoding it does not support, it can fail sooner than it would if it continues to read the document (long after the declaration) and encounters an unexpected byte sequence for the encoding detected using implementation-dependent heuristics.

like image 181
Peter O. Avatar answered Jan 11 '23 19:01

Peter O.


I'd highly, HIGHLY recommend reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). You're saying XML is "just text" as if that makes everything simple, but even knowing that it's text as opposed to some structured binary format doesn't mean you know exactly how to read it or what characters are therein.

This isn't a "go read the manual!" answer, I believe establishing this baseline on how difficult text can be will help explain why the XML declaration exists.

why does XML declaration need encoding in the first place?

This is one of the ideas addressed in the article, but it's worth stressing here: All text has an encoding. There is no such thing as 'Plain Text'. ASCII is an encoding, even if we don't think about it most of the time. Historically we've often stuck our head in the sand and assumed everything is ASCII, but this isn't feasible in today's day & age. The XML declaration's encoding helps us out, where has a .txt file has nothing to indicate what its encoding is.

like image 23
Jason Viers Avatar answered Jan 11 '23 18:01

Jason Viers