Unicode Transformation Format, 8-bit encoding form is designed for ease of use with existing ASCII-based systems and enables use of all the characters in the Unicode standard.
UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.
Advertisements. Encoding is the process of converting unicode characters into their equivalent binary representation. When the XML processor reads an XML document, it encodes the document depending on the type of encoding.
UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.
To understand the "encoding" attribute, you have to understand the difference between bytes and characters.
Think of bytes as numbers between 0 and 255, whereas characters are things like "a", "1" and "Ä". The set of all characters that are available is called a character set.
Each character has a sequence of one or more bytes that are used to represent it; however, the exact number and value of the bytes depends on the encoding used and there are many different encodings.
Most encodings are based on an old character set and encoding called ASCII which is a single byte per character (actually, only 7 bits) and contains 128 characters including a lot of the common characters used in US English.
For example, here are 6 characters in the ASCII character set that are represented by the values 60 to 65.
Extract of ASCII Table 60-65
╔══════╦══════════════╗
║ Byte ║ Character ║
╠══════╬══════════════║
║ 60 ║ < ║
║ 61 ║ = ║
║ 62 ║ > ║
║ 63 ║ ? ║
║ 64 ║ @ ║
║ 65 ║ A ║
╚══════╩══════════════╝
In the full ASCII set, the lowest value used is zero and the highest is 127 (both of these are hidden control characters).
However, once you start needing more characters than the basic ASCII provides (for example, letters with accents, currency symbols, graphic symbols, etc.), ASCII is not suitable and you need something more extensive. You need more characters (a different character set) and you need a different encoding as 128 characters is not enough to fit all the characters in. Some encodings offer one byte (256 characters) or up to six bytes.
Over time a lot of encodings have been created. In the Windows world, there is CP1252, or ISO-8859-1, whereas Linux users tend to favour UTF-8. Java uses UTF-16 natively [see comments].
One sequence of byte values for a character in one encoding might stand for a completely different character in another encoding, or might even be invalid.
For example, in ISO 8859-1, â is represented by one byte of value 226
, whereas in UTF-8 it is two bytes: 195, 162
. However, in ISO 8859-1, 195, 162
would be two characters, Ã, ¢.
Think of XML as not a sequence of characters but a sequence of bytes.
Imagine the system receiving the XML sees the bytes 195, 162
. How does it know what characters these are?
In order for the system to interpret those bytes as actual characters (and so display them or convert them to another encoding), it needs to know the encoding used in the XML.
Since most common encodings are compatible with ASCII, as far as basic alphabetic characters and symbols go, in these cases, the declaration itself can get away with using only the ASCII characters to say what the encoding is. In other cases, the parser must try and figure out the encoding of the declaration. Since it knows the declaration begins with <?xml
it is a lot easier to do this.
Finally, the version
attribute specifies the XML version, of which there are two at the moment (see Wikipedia XML versions. There are slight differences between the versions, so an XML parser needs to know what it is dealing with. In most cases (for English speakers anyway), version 1.0 is sufficient.
An XML declaration is not required in all XML documents; however XHTML document authors are strongly encouraged to use XML declarations in all their documents. Such a declaration is required when the character encoding of the document is other than the default UTF-8 or UTF-16 and no encoding was determined by a higher-level protocol. Here is an example of an XHTML document. In this example, the XML declaration is included.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>.</p>
</body>
</html>
Please refer to the W3 standards for XML.
This is the XML optional preamble.
version="1.0"
means that this is the XML standard this file conforms toencoding="utf-8"
means that the file is encoded using the UTF-8 Unicode encodingThe encoding declaration identifies which encoding is used to represent the characters in the document.
More on the XML Declaration here: http://msdn.microsoft.com/en-us/library/ms256048.aspx
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With