Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between encoding and charset?

People also ask

Is UTF-8 character set or encoding?

The Difference Between Unicode and UTF-8 Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points). A = 65, B = 66, C = 67, ....

Is ASCII charset or encoding?

ASCII (/ˈæskiː/ ( listen) ASS-kee), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices.

What is the purpose of a charset?

The purpose of the charset parameter is to specify the encoding of the external script in cases where the encoding is not specified at the HTTP protocol level. It is not meant to override encoding information in HTTP headers, and it does not do that.

What does charset mean in coding?

Definition and Usage The charset attribute specifies the character encoding for the HTML document. The HTML5 specification encourages web developers to use the UTF-8 character set, which covers almost all of the characters and symbols in the world!


Basically:

  1. charset is the set of characters you can use
  2. encoding is the way these characters are stored into memory

Every encoding has a particular charset associated with it, but there can be more than one encoding for a given charset. A charset is simply what it sounds like, a set of characters. There are a large number of charsets, including many that are intended for particular scripts or languages.

However, we are well along the way in the transition to Unicode, which includes a character set capable of representing almost all the world's scripts. However, there are multiple encodings for Unicode. An encoding is a way of mapping a string of characters to a string of bytes. Examples of Unicode encodings include UTF-8, UTF-16 BE, and UTF-16 LE . Each of these has advantages for particular applications or machine architectures.


In addition to the other answers, I think this article is a good read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

The essay is from 2003, but (unfortunately) the content is still valid...


A character encoding consists of:

  1. The set of supported characters
  2. A mapping between characters and integers ("code points")
  3. How code points are encoded as a series of "code units" (e.g., 16-bit units for UTF-16)
  4. How code units are encoded into bytes (e.g., big-endian or little-endian)

Step #1 by itself is a "character repertoire" or abstract "character set", and #1 + #2 = a "coded character set".

But back before Unicode became popular and everyone (except East Asians) was using a single-byte encoding, steps #3 and #4 were trivial (code point = code unit = byte). Thus, older protocols didn't clearly distinguish between "character encoding" and "coded character set". Older protocols use charset when they really mean encoding.


Throwing more light for people visiting henceforth, hopefully it would be helpful.


Character Set

There are characters in each language and collection of those characters form the “character set” of that language. When a character is encoded then it is assigned a unique identifier or a number called as code point. In computer, these code points will be represented by one or more bytes.

Examples of character set: ASCII (covers all English characters), ISO/IEC 646, Unicode (covers characters from all living languages in the world)

Coded Character Set

A coded character set is a set in which a unique number is assigned to each character. That unique number is called as "code point".
Coded character sets are sometimes called code pages.

Encoding

Encoding is the mechanism to map the code points with some bytes so that a character can be read and written uniformly across different system using same encoding scheme.

Examples of encoding: ASCII, Unicode encoding schemes like UTF-8, UTF-16, UTF-32.

Elaboration of above 3 concepts

  • Consider this - Character 'क' in Devanagari character set has a decimal code point of 2325 which will be represented by two bytes (09 15) when using the UTF-16 encoding
  • In “ISO-8859-1” encoding scheme “ü” (this is nothing but a character in Latin character set) is represented as hexa-decimal value of FC while in “UTF-8” it represented as C3 BC and in UTF-16 as FE FF 00 FC.
  • Different encoding schemes may use same code point to represent different characters, for example in “ISO-8859-1” (also called as Latin1) the decimal code point value for the letter ‘é’ is 233. However, in ISO 8859-5, the same code point represents the Cyrillic character ‘щ’.
  • On the other hand, a single code point in the Unicode character set can actually be mapped to different byte sequences, depending on which encoding was used for the document. The Devanagari character क, with code point 2325 (which is 915 in hexadecimal notation), will be represented by two bytes when using the UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four bytes with UTF-32 (00 00 09 15)