The Latin-1 characters in the range 128-255 are not valid within a UTF-8 context. Although they do share the same character codes, in UTF-8 they are represented differently.
Latin-1, also called ISO-8859-1, is an 8-bit character set endorsed by the International Organization for Standardization (ISO) and represents the alphabets of Western European languages.
Most libraries that don't hold a lot of foreign language materials will be perfectly fine with ISO8859-1 ( also called Latin-1 or extended ASCII) encoding format, but if you do have a lot of foreign language materials you should choose UTF-8 since that provides access to a lot more foreign characters.
UTF-8 is prepared for world domination, Latin1 isn't.
If you're trying to store non-Latin characters like Chinese, Japanese, Hebrew, Russian, etc using Latin1 encoding, then they will end up as mojibake. You may find the introductory text of this article useful (and even more if you know a bit Java).
Note that full 4-byte UTF-8 support was only introduced in MySQL 5.5. Before that version, it only goes up to 3 bytes per character, not 4 bytes per character. So, it supported only the BMP plane and not e.g. the Emoji plane. If you want full 4-byte UTF-8 support, upgrade MySQL to at least 5.5 or go for another RDBMS like PostgreSQL. In MySQL 5.5+ it's called utf8mb4
.
In latin1 each character is exactly one byte long. In utf8 a character can consist of more than one byte. Consequently utf8 has more characters than latin1 (and the characters they do have in common aren't necessarily represented by the same byte/bytesequence).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With