This is my first question in stackoverflow and I am delighted to be part of this community because it has helped me many times.
I'm not an expert in SQL and MySQL but I'm working in a project that needs large tables (million rows). I have a problem when doing a join and I don't understand why it takes so long. Thanks in advance:)
Here are the tables:
CREATE TABLE IF NOT EXISTS tabla_maestra(
id int UNIQUE,
codigo_alta char(1),
nombre varchar(100),
empresa_apellido1 varchar(150),
apellido2 varchar(50),
tipo_via varchar(20),
nombre_via varchar(100),
numero_via varchar(50),
codigo_via char(5),
codigo_postal char(5),
nombre_poblacion varchar(100),
codigo_ine char(11),
nombre_provincia varchar(50),
telefono varchar(250) UNIQUE,
actividad varchar(100),
estado char(1),
codigo_operadora char(3)
);
CREATE TABLE IF NOT EXISTS tabla_actividades_empresas(
empresa_apellido1 varchar(150),
actividad varchar(100)
);
Here is the query I want to do:
UPDATE tabla_maestra tm
INNER JOIN tabla_actividades_empresas tae
ON (tm.nombre!='' AND tae.empresa_apellido1=tm.empresa_apellido1)
SET tm.actividad=tae.actividad;
This query takes too long, and before executing it I was trying to test how long takes this simplier query:
SELECT COUNT(*) FROM tabla_maestra tm
INNER JOIN tabla_actividades_empresas tae
ON (tm.nombre!='' AND tae.empresa_apellido1=tm.empresa_apellido1);
It is still taking too long, and I don't understand why. Here are the indexes I use:
CREATE INDEX cruce_nombre
USING HASH
ON tabla_maestra (nombre);
CREATE INDEX cruce_empresa_apellido1
USING HASH
ON tabla_maestra (empresa_apellido1);
CREATE INDEX index_actividades_empresas
USING HASH
ON tabla_actividades_empresas(empresa_apellido1);
If I use the EXPLAIN statement, these are the results:
http://oi59.tinypic.com/2zedoy0.jpg
I would be so grateful to receive any answer that could help me. Thanks a lot, Dani.
A join involving half a million rows -- as your query plan shows -- is bound to take some time. The count(*) query is quicker because it doesn't need to read the tabla_maestra
table itself, but it still needs to scan all the rows of index cruce_empresa_apellido1
.
It might help some if you made index index_actividades_empresas
a unique index (supposing that that's indeed appropriate) or if instead you drop that index and make column empresa_apellido1
a primary key of table tabla_actividades_empresas
.
If even that does not give you sufficient performance, then the only other thing I see to do is to give table tabla_actividades_empresas
a synthetic primary key of integer type, and to change the corresponding column of tabla_maestra to match. That should help because comparing an integer to an integer is faster than comparing a string to a string, even when you can filter out (most) mismatches via a hash.
I agree with the other ones (see John Bollinger i.e.) about the lack of Primary Keys on it. It's highly adiviced for IDs (I noticed you worry about it be repeated, but PK smoothly treats it too - I meant MySQL's AUTOINCREMENT).
Why do you use the tabla_actividades_empresas
.empresa_apellido1
instead of look for tabla_maestra's ID to be referenced in?
If so, you could define Foreign Key to it: tabla_actividades_empresas
.maestra_id
i.e.
Because it gets better if you associate tables with non-strings types.
You also can subquery the tables before the JOIN action between them. It's an example:
UPDATE (SELECT * FROM tabla_maestra WHERE nombre != '') AS tm
INNER JOIN tabla_actividades_empresas AS tae
ON tae.empresa_apellido1 = tm.empresa_apellido1
SET tm.actividad = tae.actividad;
I have not tested it. But it seems to be a nice behavior to follow since then.
Oh... everytime do you need to update all the data rows? Unless, you can update only the forgotten ones. You can apply the UPDATE
by INNER JOIN
after one LEFT JOIN
to determine the needed ones to be updated. Does it have sense? I'm not any expert, but it can be useful to think about.
EDIT
You may test some subquery too:
UPDATE tabla_maestra AS main, tabla_actividades_empresas AS aggr
SET main.actividad = aggr.actividad
WHERE main.empresa_apellido1 = aggr.empresa_apellido1
AND main.nombre <> ''
Don't forget to try of adjusting the relationship.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With