Could anyone give me a concise definitions of <ul> <li>Unicode</li> <li>UTF7</li> <li>UTF8</li> <li>UTF16</li> <li>UTF32</li> <li>Codepages</li> <li>How they differ from Ascii/Ansi/Windows 1252</li> </ul> I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.

If you want a really brief introduction: Unicode in 5 Minutes Or if you are after one-liners: <ul> <li> Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use</li> <li> UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use </li> <li> UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding</li> <li> UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent</li> <li> UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation</li> <li> Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments</li> <li> ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8</li> <li> ANSI: a standards body</li> <li> Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused</li> </ul> Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode <ul> <li>Þ (LATIN CAPITAL LETTER THORN)</li> <li>ﬁ (LATIN SMALL LIGATURE FI) </li> <li>ή (GREEK SMALL LETTER ETA WITH TONOS)</li> <li>or 13 other characters, depending on the encoding and character set used.</li> </ul>

Dummy's guide to Unicode

1 Answers

If you want a really brief introduction: Unicode in 5 Minutes

Or if you are after one-liners:

Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
ANSI: a standards body
Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused

Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode

Þ (LATIN CAPITAL LETTER THORN)
ﬁ (LATIN SMALL LIGATURE FI)
ή (GREEK SMALL LETTER ETA WITH TONOS)
or 13 other characters, depending on the encoding and character set used.

177

answered Sep 28 '22 08:09

MtnViewMark

Related questions
                            
                                getting bytes from unicode string in python
                            
                                Is there an html special character for a down-right arrow?
                            
                                replace emoji unicode symbol using regexp in javascript
                            
                                Unicode characters in emacs term-mode
                            
                                Why do we need both UCS and Unicode character sets? [closed]
                            
                                std::u32string conversion to/from std::string and std::u16string
                            
                                NameError: name 'unicode' is not defined [duplicate]
                            
                                Convert Unicode to ASCII without changing the string length (in Java)
                            
                                python: unicode problem
                            
                                adb shell input unicode character
                            
                                Windows unicode commandline argv
                            
                                unicode text file output differs between XE2 and Delphi 2009?
                            
                                Set C# console application to Unicode output
                            
                                How to write Russian characters in file?
                            
                                Unicode with knitr and Rmarkdown
                            
                                String#encode not fixing "invalid byte sequence in UTF-8" error
                            
                                PHP export CSV when data having UTF8 charcters
                            
                                'UCS-2' codec can't encode characters in position 1050-1050
                            
                                Whitespace gone from PDF extraction, and strange word interpretation
                            
                                What characters are allowed in Perl identifiers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dummy's guide to Unicode

Tags:

unicode

utf-8

utf-16

codepages

Arec Barrwin

People also ask

1 Answers

MtnViewMark

Recent Activity

Donate For Us