What's the difference between an "encoding," a "character set," and a "code page"?

1 Answers

A ‘character set’ is just what it says: a properly-specified list of distinct characters.

An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.

UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).

The confusion comes about because most other well-known encodings (eg.: ISO-8859-1) started out as separate character sets. Then when Unicode came along as a superset of most of these character sets, it became possible to think of them as different (but partial) encodings of the same (Unicode) character set, rather than just isolated character sets. Looking at them this way allows you to convert between them through Unicode easily, which would not be possible if they were merely isolated character sets. But it still makes sense to refer to them as character sets, so either term could be used.

A ‘code page’ is a term stemming from IBM, where it chose which set of symbols would be displayed. The term continued to be used by DOS and then Windows, through to Unicode-aware Windows where it just acts as an encoding with a numbered identifier. Whilst a numbered ‘code page’ is an idea not inherently limited to Microsoft, today the term would almost always just mean an encoding that Windows knows about.

When one is talking of code page ‹some number› one is typically talking about a Windows-specific encoding, as distinct from an encoding devised by a standards body. For example code page 28591 would not normally be referred to under that name, but simply ‘ISO-8859-1’. The Windows-specific Western European encoding based on ISO-8859-1 (with a few extra characters replacing some of its control codes) would normally be referred to as ‘code page 1252’.

[*: All the UTFs are encodings not character sets, but this kind of thing isn't exclusive to Unicode. For example the Japanese standard JIS X 0208 defines a character set and two different byte encodings for it: the somewhat unpleasant high-byte-based encoding (‘Shift-JIS’), and the deeply horrific escape-switching-based encoding (‘JIS’).]

answered Sep 19 '22 00:09

bobince

Related questions
                            
                                Should the percent symbol (%) always be HTML-escaped?
                            
                                Dangers of sys.setdefaultencoding('utf-8')
                            
                                How to print UTF-8 strings to std::cout on Windows?
                            
                                What is the difference between #encode and #force_encoding in ruby?
                            
                                How to specify output file encoding in Ruby?
                            
                                Get/set file encoding with javascript's FileReader
                            
                                XML file encoding format "utf-8" VS "UTF-8"?
                            
                                How to generate javadoc documentation with umlauts?
                            
                                Easy way to convert a unicode list to a list containing python strings?
                            
                                How to create a SQL injection attack with Shift-JIS and CP932?
                            
                                Should source code be saved in UTF-8 format
                            
                                MySQL treats ÅÄÖ as AAO?
                            
                                UTF8 Postgresql Create Database Like MySQL (including character set, encoding, and lc_type)
                            
                                Base64 Encoding safe for filenames?
                            
                                Base64 Encode a PDF in C#?
                            
                                Is a base64 encoded string unique?
                            
                                Text files uploaded to S3 are encoded strangely?
                            
                                Guessing the encoding of text represented as byte[] in Java
                            
                                What is the difference between serializing and encoding?
                            
                                What is the exact difference between Windows-1252(1/3/4) and ISO-8859-1?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between an "encoding," a "character set," and a "code page"?

Tags:

encoding

codepages

Deane

People also ask

1 Answers

bobince

Recent Activity

Donate For Us