Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?

In what way are these helpful for programmers?

like image 495
web dunia Avatar asked Mar 31 '09 06:03

web dunia


People also ask

What is the difference between UTF and Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

What is the difference between Unicode and ASCII encoding?

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc. in computers.

What is the difference between Unicode and ANSI?

ANSI vs Unicode Usage is also the main difference between the two as ANSI is very old and is used by operating systems like Windows 95/98 and older, while Unicode is a newer encoding that is used by all of the current operating systems today.

What is the difference between Unicode UTF ASCII and ANSI code of encoding?

ANSI and UTF-8 are both encoding formats. ANSI is the common one byte format used to encode Latin alphabet; whereas, UTF-8 is a Unicode format of variable length (from 1 to 4 bytes) which can encode all possible characters.


1 Answers

Going down your list:

  • "Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
  • UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
  • UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
  • UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
  • UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
  • ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
  • ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.

There's more on my Unicode page and tips for debugging Unicode problems.

The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.

like image 158
Jon Skeet Avatar answered Sep 27 '22 16:09

Jon Skeet