Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dummy's guide to Unicode

Could anyone give me a concise definitions of

  • Unicode
  • UTF7
  • UTF8
  • UTF16
  • UTF32
  • Codepages
  • How they differ from Ascii/Ansi/Windows 1252

I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.

like image 214
Arec Barrwin Avatar asked Sep 21 '09 14:09

Arec Barrwin


People also ask

How do you code Unicode?

Unicode characters can then be entered by holding down Alt , and typing + on the numeric keypad, followed by the hexadecimal code – using the numeric keypad for digits from 0 to 9 and letter keys for A to F – and then releasing Alt .

How do I type a Unicode font?

Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X.

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

Is UTF-16 and Unicode the same?

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.


1 Answers

If you want a really brief introduction: Unicode in 5 Minutes

Or if you are after one-liners:

  • Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
  • UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
  • UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
  • UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
  • UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
  • Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
  • ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
  • ANSI: a standards body
  • Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused

Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode

  • Þ (LATIN CAPITAL LETTER THORN)
  • fi (LATIN SMALL LIGATURE FI)
  • ή (GREEK SMALL LETTER ETA WITH TONOS)
  • or 13 other characters, depending on the encoding and character set used.
like image 177
MtnViewMark Avatar answered Sep 28 '22 08:09

MtnViewMark