Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF8 versus Latin1

Tags:

php

mysql

utf-8

I am trying to understand the difference between Latin1 and UTF8 and for the most part I get it, however, when testing I am getting some weird results and could use some help clarifying

I am testing with 'é' (Latin small letter E with acute) and the link below shows the hex c3a9

I setup a database and specified utf8 as the character set, then created a table with utf8 as the character set and inserted a record with the character 'é' after setting the connection and client character set to UTF8

when I do a select hex(field), field from test_table I get:

hex(field), field C3A9, é

This is fine and consistent with what I read, however, when I do the exact same using a latin1 character set I get the following:

hex(field), field C3A9, é

but if I enter char(x'E9') which should be the single byte Latin1 equivalent value for é I manage to get it to display correctly using 'set names UTF8' but it doesn't show up correctly when setting the connection and client to Latin1

Can anyone clarify? - shouldn't Latin1 characters be single byte (Hex E9) in both UTF8 and Latin1? or am I completely misunderstanding it all?

Thanks

like image 243
digital_paki Avatar asked Dec 07 '15 19:12

digital_paki


2 Answers

latin1 encoding has only 1-byte codes.

The first 128 codes (7-bits) are mostly identical between latin1 and utf8.

é is beyond the 128; it's 1-byte, 8-bit latin1 hex is E9 (as you observed). For utf8, it takes 2 bytes: C3A9. For most Asian characters, utf8 takes 3 byte; latin1 cannot represent those characters.

MySQL has the confusing command SET NAMES utf8. That announces that the client's encoding is utf8, and instructs the communication between client and server to convert between the column's CHARACTER SET and utf8 when reading/writing.

If you have SET NAMES latin1 (the old default), but the bytes in the client are encoded utf8, then you are 'lying', and various nasty things happen. But there is no immediate clue that something is wrong.

Checklist for going entirely utf8:

  • Bytes in client are utf8-encoded
  • SET NAMES utf8 (or equivalent parameter during connecting to MySQL)
  • CHARACTER SET utf8 on column or table declaration
  • <meta ... UTF-8> in html
like image 83
Rick James Avatar answered Oct 08 '22 19:10

Rick James


After recently putting a website through the ringer UTF-8 wise I think this is a case of viewing UTF-8 data in a latin1 table within a UTF-8 encoded page or terminal.

If you are using a terminal you can check this by looking at the character encoding setting of the terminal (in Ubuntu it's Terminal -> Set Character Encoding). If you are using something like PHPMyAdmin, view the page source and look for the charset of the page, or open up Firebug and look at the response headers for the page, it should say "UTF-8".

If you've inserted the data and it's encoded in UTF-8 and it goes into a latin1 table then the data will still be stored in UTF-8, it's only when you start viewing that data or retrieving that data in a different encoding that you start getting the mangled effect.

I've found it's really crucial that when you are working with character encoding that you get everything the same: the page must have a charset of UTF-8, the upstream into the database must be in UTF-8, the database must have a default charset and storage of UTF-8. As soon as you put a different charset in the mix everything goes crazy.

like image 38
redreddington Avatar answered Oct 08 '22 18:10

redreddington