What are the difficulties inherent in ASCII and Extended ASCII and how these difficulties are overcome by Unicode?
Can some one explain me the unicode compatibility?
And what does the terms associated with Unicode like Planes, Basic Multilingual Plane (BMP), Suplementary Multilingual Plane (SMP), Suplementary Ideographic Plane (SIP), Supplementary Special Plane (SSP) and Private Use Planes (PUP) means.
I have found all these words very confusing
ASCII was less or more the first character encoding ever. At the ages when a byte was very expensive and 1MHz was extremely fast, only the characters which appeared on those ancient US typewriters (as well as at the average US International keyboard nowadays) were covered by the charset of the ASCII character encoding. This includes the complete Latin alphabet (A-Z, in both the lowercased and uppercased flavour), the numeral digits (0-9), the lexical control characters (space, dot, comma, colon, etcetera) and some special characters (the at sign, the sharp sign, the dollar sign, etcetera). All those characters fill up the space of 7 bits, half of the room a byte provides, with a total of 128 characters.
Later the remaining bit of the byte is used for Extended ASCII which provides room for a total of 255 characters. Most of the remaining room is used by special characters, such as diacritical characters and line drawing characters. But because everyone used the remaining room their own way (IBM, Commodore, Universities, Organizations, etcetera), it was not interchangeable. Characters which were originally encoded using encoding X will show up as Mojibake when they are decoded using a different encoding Y. Later ISO came up with standard character encoding definitions for 8 bit ASCII extensions, resulting in the known ISO 8859 character encoding standards based on top of ASCII such as ISO 8859-1, so that it is all better interchangeable.
8 bits may be enough for the languages using the Latin alphabet, but it is certainly not enough for the remaining non-Latin languages in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etcetera, let alone to include them all in only 8 bits. They developed their own non-ISO character encodings which was -again- not interchangeable, such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etcetera. Finally a new character encoding standard based on top of ISO 8859-1 was established to cover any of the characters used at the world so that it is interchangeable everywhere: Unicode. It provides room for over a million characters of which currently about 10% is filled. The UTF-8 character encoding is based on Unicode.
The Unicode characters are categorized in seventeen planes, each providing room for 65536 characters (16 bits).
Usually, you would be only interested in the BMP and using UTF-8 encoding as the standard character encoding throughout your entire application.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With