Is UTF-8 an encoding or a character set?

Tags:

I thought that the name of the character set was "Unicode" and that "UTF-8" was the name of a particular encoding of the Unicode character set, but I often see the terms "encoding" and "charset" used interchangeably when referring to UTF-8.

For example,

<meta charset="UTF-8">

<?xml version="1.0" encoding="UTF-8" ?>

795

asked Mar 05 '13 15:03

J Smith

1 Answers

Is UTF-8 an encoding or a character set?

UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below.

I often see the terms "encoding" and "charset" used interchangeably

Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet. Thus, the terms encoding and charset were often conflated but they mean different things.

Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.

† - Alphabet, a kind of *character set* where characters correspond directly to sounds in a spoken language.

A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language. Unicode is a character set that maps 21b integers to unicode codepoints. The Unicode Consortium's glossary describes it thus:

Unicode

The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium: http://www.unicode.org.

A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.

An encoding is a mapping from strings to strings. UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629.

The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8

127

answered Oct 23 '22 13:10

Mike Samuel

Related questions
                            
                                std::u16string, std::u32string, std::string, length(), size(), codepoints and characters
                            
                                What characters are NOT present in Unicode?
                            
                                Javascript string comparison fails when comparing unicode characters
                            
                                Truncating string to byte length in Python
                            
                                How can I open UTF-16 files on Python 2.x?
                            
                                How to make console be able to print any of 65535 UNICODE characters
                            
                                On Windows, when should you use the "\\\\?\\" filename prefix?
                            
                                Why is std::u16cout missing?
                            
                                Displaying Arabic characters in C# console application
                            
                                FINNISH: How to specify Unicode Date Formatter (like MMMM yyyy) to work in Finnish language
                            
                                RegEx: \w - "_" + "-" in UTF-8
                            
                                MySQL collation to store multilingual data of unknown language
                            
                                In Win7, Unicode/ UTF-8 text file: gibberish on Windows console (Trying to display hebrew)
                            
                                Is it bad to have accented characters in c++ source code?
                            
                                Why __unicode__ doesn't work but __str__ does?
                            
                                How do you i use mandarin characters in matplotlib?
                            
                                How to split Unicode string to characters in JavaScript
                            
                                Preparing PHP application to use with UTF-8
                            
                                Regex matching letter characters [duplicate]
                            
                                UTF-16 string terminator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is UTF-8 an encoding or a character set?

Tags:

encoding

unicode

utf-8

character

J Smith

People also ask

1 Answers

Unicode

Mike Samuel

Recent Activity

Donate For Us