Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare same values stored with different encodings

Tags:

php

mysql

This question is not a duplicate of PHP string comparison between two different types of encoding because my question requires a SQL solution, not a PHP solution.


Context ► There's a museum with two databases with the same charset and collation (engine=INNODB charset=utf8 collate=utf8_unicode_ci) used by two different PHP systems. Each PHP system stores the same data in a different way, next image is an example :

enter image description here

There are tons of data already stored that way and both systems are working fine, so I can't change the PHP encoding or the databases'. One system handles the sales from the box office, the other handles the sales from the website.

The problem ► I need to compare the right column (tipo_boleto_tipo) to the left column (tipo) in order to get the value in another column of the left table (unseen in image), but I'm getting no results because the same values are stored different, for example, when I search for "Niños" it is not found because it was stored as "Niños" ("children" in spanish). I tried to do it via PHP by using utf8_encode and utf8_decode but it is unacceptably slow, so I think it's better to do it with SQL only. This data will be used for a unified report of sales (box office and internet) in variable periods of time and it has to compare hundreds of thousands of rows, that's why it is so slow with PHP.

The question ► Is there anything like utf8_encode or utf8_decode in MYSQL that allows me to match the equivalent values of both columns? Any other suggestion will be welcome.

Next is my current code (with no results) :

            DATABASE    TABLE      COLUMN
               ▼          ▼          ▼
    SELECT boleteria.tipos_boletos.genero            ◄ DESIRED COLUMN.
      FROM boleteria.tipos_boletos                   ◄ DATABASE WITH WEIRD CHARS.
INNER JOIN venta_en_linea.ventas_detalle             ◄ DATABASE WITH PROPER CHARS.
        ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo
     WHERE venta_en_linea.ventas_detalle.evento_id='1'
       AND venta_en_linea.ventas_detalle.tipo_boleto_tipo = 'Niños'

The line ON venta_en_linea.ventas_detalle.tipo_boleto_tipo = boleteria.tipos_boletos.tipo is never gonna work because both values are different ("Niños" vs "Niños").

like image 275
Jose Manuel Abarca Rodríguez Avatar asked Sep 11 '17 16:09

Jose Manuel Abarca Rodríguez


1 Answers

It appears the application which writes to the boleteria database is not storing correct UTF-8. The database column character set refers to how MySQL interprets strings, but your application can still write in other character sets.

I can't tell from your example exactly what the incorrect character set is, but assuming it's Latin-1 you can convert it to latin1 (to make it "correct"), then convert it back to "actual" utf8:

SELECT 1
FROM tipos_boletos, ventas_detalle 
WHERE CONVERT(CAST(CONVERT(tipo USING latin1) AS binary) USING utf8) 
      = tipo_boleto_tipo COLLATE utf8_unicode_ci

I've seen this all too often in PHP applications that weren't written carefully from the start to use UTF-8 strings. If you find the performance too slow and you need to convert frequently, and you don't have an opportunity to update the application writing the data incorrectly, you can add a new column and trigger to the tipos_boletos table and convert on the fly as records are added or edited.

like image 129
Matt S Avatar answered Sep 21 '22 10:09

Matt S