What is encoding in XML? The normal encoding used is utf-8. How is it different from other encoding? What is the purpose of using it?
XML Encoding is defined as the process of converting Unicode characters into binary format and in XML when the processor reads the document it mandatorily encodes the statement to the declared type of encodings, the character encodings are specified through the attribute 'encoding'.
XML documents must be encoded in a supported code page. XML documents generated in or parsed from national data items must be encoded in Unicode UTF-16 in big-endian format, CCSID 1200.
Unicode Transformation Format, 8-bit encoding form is designed for ease of use with existing ASCII-based systems and enables use of all the characters in the Unicode standard.
UTF-8 is the default character encoding for XML documents. Character encoding can be studied in our Character Set Tutorial. UTF-8 is also the default encoding for HTML5, CSS, JavaScript, PHP, and SQL.
A character encoding specifies how characters are mapped onto bytes. Since XML documents are stored and transferred as byte streams, this is necessary to represent the unicode characters that make up an XML document.
UTF-8 is chosen as the default, because it has several advantages:
Character encodings are a more general topic than just XML. UTF-8 is not restricted to being used in XML only.
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a good article that gives a good overview over the topic.
When computers were first created, they mostly only worked with characters found in the english language, leading to the 7-bit US-ASCII standard.
However, there are a lot of different written languages in the world, and ways had to be found to be able to use them in computers.
The first way works fine if you restrict yourself to a certain language, it's to use a culture specific encoding, such as ISO-8859-1, which is able to represent latin-european language characters on 8-bits, or GB2312 for chinese characters.
The second way is a bit more complicated, but allows theoretically to represent every character in the world, it's the Unicode standard, in which every character from every language has a specific code. However, given the high number of existing characters (109,000 in Unicode 5), unicode characters are normally represented using a three byte representation (one byte for the Unicode plane, and two bytes for the character code.
In order to maximize compatibility with existing code (some is still using text in ASCII), the UTF-8 standard encoding was devised as a way to store Unicode characters, only using the minimal amount of space, as described in Joachim Sauer's answer.
So, it's common to see files encoded with specific charsets such as ISO-8859-1 if the file is meant to be edited or read only by software (and people) understanding only these languages, and UTF-8 when there's the need to be highly interoperable and culture-independant. The current tendancy is for UTF-8 to replace other charsets, even though it needs work from software developers, since UTF-8 strings are more complicated to handle than fixed-width charset strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With