Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do we need both UCS and Unicode character sets? [closed]

Tags:

unicode

ucs

I guess the codepoints of UCS and Unicode are the same, am I right?

In that case, why do we need two standards (UCS and Unicode)?

like image 523
Lunar Mushrooms Avatar asked Jan 14 '12 05:01

Lunar Mushrooms


People also ask

What is the Unicode character set and why is it used?

Unicode uses between 8 and 32 bits per character, so it can represent characters from languages from all around the world. It is commonly used across the internet. As it is larger than ASCII, it might take up more storage space when saving documents.

Why do we need Unicode scheme?

The objective of Unicode is to unify all the different encoding schemes so that the confusion between computers can be limited as much as possible. These days, the Unicode standard defines values for over 128,000 characters and can be seen at the Unicode Consortium.

What is UCS Unicode?

The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings, improving as characters from previously ...

Why Unicode is needed as well as ASCII?

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers. ASCII : It is a character encoding standard for electronic communication.


2 Answers

They are not two standards. The Universal Character Set (UCS) is not a standard but something defined in a standard, namely ISO 10646. This should not be confused with encodings, such as UCS-2.

It is difficult to guess whether you actually mean different encodings or different standards. But regarding the latter, Unicode and ISO 10646 were originally two distinct standardization efforts with different goals and strategies. They were however harmonized in the early 1990s to avoid all the mess resulting from two different standards. They have been coordinated so that the code points are indeed the same.

They were kept distinct, though, partly because Unicode is defined by an industry consortium that can work flexibly and has great interest in standardizing things beyond simple code point assignments. The Unicode Standard defines a large number of principles and processing rules, not just the characters. ISO 10646 is a formal standard that can be referenced in standards and other documents of the ISO and its members.

like image 141
Jukka K. Korpela Avatar answered Oct 14 '22 18:10

Jukka K. Korpela


The codepoints are the same but there are some differences. From the Wikipedia entry about the differences between Unicode and ISO 10646 (i.e. UCS):

The difference between them is that Unicode adds rules and specifications that are outside the scope of ISO 10646. ISO 10646 is a simple character map, an extension of previous standards like ISO 8859. In contrast, Unicode adds rules for collation, normalization of forms, and the bidirectional algorithm for scripts like Hebrew and Arabic

You might find useful to read the Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

I think the differences come from the way the code points are encoded. UCS-x uses a fixed amount of bytes to encode a code point. For example, UCS-2 uses two bytes. However, UCS-2 cannot encode code points that would require over 2 bytes. On the other hand, UTF uses variable amount of bytes for encoding. For example, UTF-8 uses at least one byte (for ascii characters) but uses more bytes if the character is outside the ascii range.

like image 30
Juuso Ohtonen Avatar answered Oct 14 '22 18:10

Juuso Ohtonen