Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do I need to know about Unicode? [closed]

Being a application developer, do I need to know Unicode?

like image 548
yesraaj Avatar asked Oct 21 '08 15:10

yesraaj


People also ask

What do I need to know about Unicode?

Unicode is a character set, that other than ASCII (which contains only letters for English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Arabian, etc.)

What is Unicode and why is it needed?

Unicode Characters The Unicode Standard provides a unique number for every character, no matter what platform, device, application or language. It has been adopted by all modern software providers and now allows data to be transported through many different platforms, devices and applications without corruption.

What is Unicode explain in brief?

Unicode is a universal character encoding standard. This standard includes roughly 100000 characters to represent characters of different languages. While ASCII uses only 1 byte the Unicode uses 4 bytes to represent characters. Hence, it provides a very wide variety of encoding.


2 Answers

Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:

The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium.

There are many common, yet easily avoided, programming errors committed by developers who don't bother to educate themselves about Unicode and its encodings.

  • First, go to the source for authoritative, detailed information and implementation guidelines.
  • As mentioned by others, Joel Spolsky has a good list of these errors.
  • I also like Elliotte Rusty Harold's Ten Commandments of Unicode.
  • Developers should also watch out for canonical representation attacks.

Some of the key concepts you should be aware of are:

  • Glyphs—concrete graphics used to represent written characters.
  • Composition—combining glyphs to create another glyph.
  • Encoding—converting Unicode points to a stream of bytes.
  • Collation—locale-sensitive comparison of Unicode strings.
like image 176
erickson Avatar answered Oct 11 '22 20:10

erickson


At the risk of just adding another link, unicode.org is a spectacular resource.

In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.

(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)

like image 28
Electrons_Ahoy Avatar answered Oct 11 '22 22:10

Electrons_Ahoy