Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Databases: column encoding, when is it important?

We are importing data from .sql script containing UTF-8 encoded data to MySQL database:

mysql ... database_name < script.sql

Later this data is being displayed on page in our web application (connected to that database), again in UTF-8. But somewhere in the process something went wrong, because non-ascii characters was displayed incorrectly.

Our first attempt to solve it was to change mysql columns encoding to UTF-8 (as described for example here):

alter table wp_posts change post_content post_content LONGBLOB;`
alter table wp_posts change post_content post_content LONGTEXT CHARACTER SET utf8;

But it didn't helped.

Finally we solved this problem by importing data from .sql script with additional command line flag which as I believe forced mysql client to treat data from .sql script as UTF-8.

mysql ... --default-character-set=utf8 database_name < script.sql

It helped but then we realized that this time we forgot to change column encoding to utf8 - it was set to latin1 even if utf-8 encoded data was flowing through database (from sql script to application).

So if data obtained from database is displayed correctly even if database character set is set incorrectly, then why the heck should I bother setting correct database encoding?

Especially I would like to know:

  1. What parts of database rely on column encoding setting? When this setting has any real meaning?
  2. On what occasions implicit conversion of column encoding is done?
  3. How does trick with converting column to binary format and then to the destination encoding work (see: sql code snippet above)? I still don't get it.

Hope someone help me to clear things up...

like image 972
Piotr Sobczyk Avatar asked May 24 '26 20:05

Piotr Sobczyk


1 Answers

The biggest reason, in my view, is that it breaks your DB consistency.

  • it happens way to often that you need to check data in the database. And if you cannot properly input UTF-8 strings coming from the web page to your MySQL CLI client, it's a pity;
  • if you need to use phpMyAdmin to administer your database through the “correct” web, then you're limiting yourself (might not be an issue though);
  • if you need to build a report on your data, then you're greatly limited by the number of possible choices, given only web is producing your the correct output;
  • if you need to deliver a partial database extract to your partner or external company for analysis, and extract is messed up — it's a pity.

Now to your questions:

  1. When you ask database to ORDER BY some column of string data type, then sorting rules takes into account the encoding of your column, as some internal trasformation are applicable in case you have different encodings for different columns. Same applies if you're trying to compare strings, encoding information is essential here. Encoding comes together with collation, although most people don't use this feature so often.

  2. As mentioned, if you have any set of columns in different encodings, database will choose to implicitly convert values to a common encoding, which is UTF8 nowadays. Strings' implicit encoding might be done in the client frameworks/libraries, depending on the client's environment encoding. Typically data is recoded into the database's encoding when sent to the server and back into client's encoding when results are delivered.

  3. Binary data has no notion of encoding, it's just a set of bytes. So when you convert to binary, you're telling database to “forget” encoding, although you keep data without changes. Later, you convert to the string enforcing the right encoding. This trick helps if you're sure that data physically is in UTF-8, while by some accident a different encoding was specified.

Given that you've managed to load in data into the database by using --default-character-set=utf8 then there was something to do with your environment, I suggest it was not UTF8 setup.

I think the best practice today would be to:

  • have all your environments being UTF8 ready, including shells;
  • have all your databases defaulting to UTF8 encoding.

This way you'll have less field for errors.

like image 155
vyegorov Avatar answered May 26 '26 11:05

vyegorov



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!