What is the difference between the <code>Unicode</code>, <code>UTF8</code>, <code>UTF7</code>, <code>UTF16</code>, <code>UTF32</code>, <code>ASCII</code>, and <code>ANSI</code> encodings? In what way are these helpful for programmers?

Going down your list: <ul> <li>"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.</li> <li> UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.</li> <li> UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.</li> <li> UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)</li> <li> UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET <code>Utf32String</code> class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)</li> <li> ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.</li> <li>ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.</li> </ul> There's more on my Unicode page and tips for debugging Unicode problems. The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.

Unicode, UTF, ASCII, ANSI format differences

1 Answers

Going down your list:

"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.

There's more on my Unicode page and tips for debugging Unicode problems.

The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.

158

answered Sep 27 '22 16:09

Jon Skeet

Related questions
                            
                                How can I use Unicode-aware regular expressions in JavaScript?
                            
                                How to check if a string in Python is in ASCII?
                            
                                Writing Unicode text to a text file?
                            
                                How do you echo a 4-digit Unicode character in Bash?
                            
                                How can I change a file's encoding with vim?
                            
                                How many bytes does one Unicode character take?
                            
                                Why does this code, written backwards, print "Hello World!"
                            
                                Replace non-ASCII characters with a single space
                            
                                How to get string objects instead of Unicode from JSON?
                            
                                How do I check if a string is unicode or ascii?
                            
                                SyntaxError: Non-ASCII character '\xa3' in file when function returns '£'
                            
                                How to remove \xa0 from string in Python?
                            
                                Representing Directory & File Structure in Markdown Syntax [closed]
                            
                                (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape [duplicate]
                            
                                How to use unicode characters in Windows command line?
                            
                                "Unicode Error "unicodeescape" codec can't decode bytes... Cannot open text files in Python 3 [duplicate]
                            
                                Placing Unicode character in CSS content value [duplicate]
                            
                                Why does 2+ 40 equal 42?
                            
                                UnicodeDecodeError, invalid continuation byte
                            
                                Unicode (UTF-8) reading and writing to files in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode, UTF, ASCII, ANSI format differences

Tags:

character-encoding

unicode

ascii

utf

ansi

web dunia

People also ask

1 Answers

Jon Skeet

Recent Activity

Donate For Us