I thought that the name of the character set was "Unicode" and that "UTF-8" was the name of a particular encoding of the Unicode character set, but I often see the terms "encoding" and "charset" used interchangeably when referring to UTF-8.
For example,
<meta charset="UTF-8">
vs
<?xml version="1.0" encoding="UTF-8" ?>
UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.
Fortunately UTF-8 is the default per sé. When reading an XML document and writing it in another encoding, mostly this attribute will be patched too.
Characters in a character set are stored as one or more bytes in a computer. Each byte or sequence of bytes represents a given character. A character encoding is the key that maps a particular byte or sequence of bytes to particular characters that the font renders as text.
UTF-8 is the most widely used way to represent Unicode text in web pages, and you should always use UTF-8 when creating your web pages and databases. But, in principle, UTF-8 is only one of the possible ways of encoding Unicode characters.
Is UTF-8 an encoding or a character set?
UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below.
I often see the terms "encoding" and "charset" used interchangeably
Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet. Thus, the terms encoding and charset were often conflated but they mean different things.
Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.
† - Alphabet, a kind of *character set* where characters correspond directly to sounds in a spoken language.A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language. Unicode is a character set that maps 21b integers to unicode codepoints. The Unicode Consortium's glossary describes it thus:
Unicode
- The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium: http://www.unicode.org.
- A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.
An encoding is a mapping from strings to strings. UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629.
The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With