Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is UTF-8 an encoding or a character set?

I thought that the name of the character set was "Unicode" and that "UTF-8" was the name of a particular encoding of the Unicode character set, but I often see the terms "encoding" and "charset" used interchangeably when referring to UTF-8.

For example,

<meta charset="UTF-8">

vs

<?xml version="1.0" encoding="UTF-8" ?>
like image 795
J Smith Avatar asked Mar 05 '13 15:03

J Smith


People also ask

Is a UTF-8 character?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.

Is UTF-8 the default encoding?

Fortunately UTF-8 is the default per sé. When reading an XML document and writing it in another encoding, mostly this attribute will be patched too.

What is the difference between character set and encoding?

Characters in a character set are stored as one or more bytes in a computer. Each byte or sequence of bytes represents a given character. A character encoding is the key that maps a particular byte or sequence of bytes to particular characters that the font renders as text.

What is UTF-8 encoding used for?

UTF-8 is the most widely used way to represent Unicode text in web pages, and you should always use UTF-8 when creating your web pages and databases. But, in principle, UTF-8 is only one of the possible ways of encoding Unicode characters.


1 Answers

Is UTF-8 an encoding or a character set?

UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below.


I often see the terms "encoding" and "charset" used interchangeably

Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet. Thus, the terms encoding and charset were often conflated but they mean different things.

Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.

† - Alphabet, a kind of *character set* where characters correspond directly to sounds in a spoken language.

A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language. Unicode is a character set that maps 21b integers to unicode codepoints. The Unicode Consortium's glossary describes it thus:

Unicode

  1. The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium: http://www.unicode.org.
  2. A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.

An encoding is a mapping from strings to strings. UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629.

The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8

like image 127
Mike Samuel Avatar answered Oct 23 '22 13:10

Mike Samuel