Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ASCII vs Unicode + UTF-8

Was reading Joel Spolsky's 'The Absolute Minimum' about character encoding. It is my understanding that ASCII is a Code-point + Encoding scheme, and in modern times, we use Unicode as the Code-point scheme and UTF-8 as the Encoding scheme. Is this correct?

like image 411
Quest Monger Avatar asked Jan 23 '14 01:01

Quest Monger


People also ask

Which is better ASCII or UTF-8?

All characters in ASCII can be encoded using UTF-8 without an increase in storage (both requires a byte of storage). UTF-8 has the added benefit of character support beyond "ASCII-characters".

Is Unicode same as UTF-8?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

Which is better ASCII or Unicode?

It is obvious by now that Unicode represents far more characters than ASCII. ASCII uses a 7-bit range to encode just 128 distinct characters. Unicode on the other hand encodes 154 written scripts.

Is US ASCII the same as UTF-8?

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.


2 Answers

In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.

like image 168
Remy Lebeau Avatar answered Oct 11 '22 23:10

Remy Lebeau


Yes, except that UTF-8 is an encoding scheme. Other encoding schemes include UTF-16 (with two different byte orders) and UTF-32. (For some confusion, a UTF-16 scheme is called “Unicode” in Microsoft software.)

And, to be exact, the American National Standard that defines ASCII specifies a collection of characters and their coding as 7-bit quantities, without specifying a particular transfer encoding in terms of bytes. In the past, it was used in different ways, e.g. so that five ASCII characters were packed into one 36-bit storage unit or so that 8-bit bytes used the extra bytes for checking purposes (parity bit) or for transfer control. But nowadays ASCII is used so that one ASCII character is encoded as one 8-bit byte with the first bit set to zero. This is the de facto standard encoding scheme and implied in a large number of specifications, but strictly speaking not part of the ASCII standard.

like image 27
Jukka K. Korpela Avatar answered Oct 12 '22 01:10

Jukka K. Korpela