Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does everyone use latin1?

Someone just said utf8 has variable length encoding from 1 to 3 bytes.

So why does everyone still use latin1? If the same thing is stored in utf8 it is also 1 byte, but utf8 has the advantage that it can adapt to a larger character set.

  • Is their a hidden reason everyone uses latin1?
  • What are the disadvantages of using utf8 vs. latin1?
like image 455
David19801 Avatar asked Jan 25 '11 11:01

David19801


People also ask

What is the difference between UTF-8 and Latin-1?

what is the difference between utf8 and latin1? They are different encodings (with some characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters.

Why do we use encoding in Latin-1?

It is used by most Unix systems as well as Windows. DOS and Mac OS, however, use their own sets. Latin-1 is occasionally, though imprecisely, referred to as Extended ASCII. This is because the first 128 characters of its set are identical to the US ASCII standard.

What is encoding =' Latin-1?

ISO 8859-1 is the ISO standard Latin-1 character set and encoding format. CP1252 is what Microsoft defined as the superset of ISO 8859-1. Thus, there are approximately 27 extra characters that are not included in the standard ISO 8859-1.

What is the difference between UTF-8 and utf8mb4?

The difference between utf8 and utf8mb4 is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms, utf8 can only store characters in the Basic Multilingual Plane, while utf8mb4 can store any Unicode character.


3 Answers

ISO 8859-1 is the (at least de facto) default character encoding of multiple standards like HTTP (at least for textual contents):

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.

The reason that ISO 8859-1 was chosen is probably as it’s a superset of US-ASCII that is the fundamental character set for internet based technologies. And as the World Wide Web was invented and developed at CERN in Geneva, Switzerland, that might be the reason to choose characters of Western European languages for the 128 remaining characters.

When the Unicode standard was developed, the character set of ISO 8859-1 was used for the base of the Unicode character set (the Universal Character Set) so that the first 256 character are identical to those of ISO 8859-1. This was probably done due to the importance of ISO 8859-1 for the Web as it already was the standard character encoding for many technologies.

Now to discuss the advantages of ISO 8859-1 in opposite to UTF-8, we need to look at the underlying character sets and the encoding schemes that are used to encode these characters:

  • ISO 8859-1 contains 256 characters where the character point of each character is directly mapped onto its binary representation. So 12310 is encoded with 011110112.

  • UTF-8 uses a prefixed variable length encoding scheme where the prefix indicates the word length. UTF-8 is used to encode the characters of the Universal Character Set and its encoding scheme can encode 1,048,576 characters. The first 128 characters require 1 byte, the characters in 0x80–0x7FF require 2 bytes, the characters in 0x800–0xFFFF require 3 bytes, and the characters in 0x10000–0x1FFFFF require 4 bytes.

So the difference if the range of codeable characters on the one hand and the length of the encoded word on the other hand.

So the choice of the “right” character encoding depends on the needs: If you only need the characters of ISO 8859-1 (or US-ASCII as a subset of it), use ISO 8859-1 as it only requires one byte for each character in opposite to UTF-8 where the characters 128–255 require two bytes. And if you need more or other characters than those in ISO 8859-1, use UTF-8.

like image 59
Gumbo Avatar answered Sep 21 '22 12:09

Gumbo


1) Performance reasons. With a constant-length, going to the n-th character of a string is easy. With variable length, you have to go through all characters from the beginning of the string to know their length. The only way to achieve this performance in unicode is through utf-32 (all characters are 4 bytes). But it takes more memory.

2) All characters with diacritics (accents) in Latin-1 are in the 128-255 range of latin-1, and therefore are encoded with more than one character in utf-8.

3) A lot of programmer don't know how to use unicode

like image 26
Scharron Avatar answered Sep 22 '22 12:09

Scharron


This could be a "reason"

Everyone uses latin1 because everyone else is too..

Its really annoying mixing different them, so you go with what the rest goes with

(i'm not saying it's a good reason, but I think it's one some people use)

like image 26
Nanne Avatar answered Sep 19 '22 12:09

Nanne