I need to select some data from MySQL DB using PHP. It can be done within one single MySQL query which takes 5 minutes to run on a good server (multiple JOINs on tables with more that 10 Mio rows).
I was wondering if it is a better practice to split the query in PHP and use some loops, rather than MySQL. Also, would it be better to query all the emails from one table with 150 000 rows in an array and then check the array instead of doing thousands of MySQL SELECTs.
Here is the Query:
SELECT count(contacted_emails.id), contacted_emails.email
FROM contacted_emails
LEFT OUTER JOIN blacklist ON contacted_emails.email = blacklist.email
LEFT OUTER JOIN submission_authors ON contacted_emails.email = submission_authors.email
LEFT OUTER JOIN users ON contacted_emails.email = users.email
GROUP BY contacted_emails.email
HAVING count(contacted_emails.id) > 3
The EXPLAIN returns:
The indexes in the 4 tables are:
contacted_emails: id, blacklist_section_id, journal_id and mail
blacklist: id, email and name
submission_authors: id, hash_key and email
users: id, email, firstname, lastname, editor_id, title_id, country_id, workplace_id
jobtype_id
The table contacted_emails is created like:
CREATE TABLE contacted_emails (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
email varchar(150) COLLATE utf8_unicode_ci NOT NULL,
contacted_at datetime NOT NULL,
created_at datetime NOT NULL,
blacklist_section_id int(11) unsigned NOT NULL,
journal_id int(10) DEFAULT NULL,
PRIMARY KEY (id),
KEY blacklist_section_id (blacklist_section_id),
KEY journal_id (journal_id),
KEY email (email) )
ENGINE=InnoDB AUTO_INCREMENT=4491706 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Your indexes look fine.
The performance problems seem to come from the fact that you're JOIN
ing all rows, then filtering using HAVING
.
This would probably work better instead:
SELECT *
FROM (
SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING COUNT(id) > 3
) AS ce
LEFT OUTER JOIN blacklist AS bl ON ce.email = bl.email
LEFT OUTER JOIN submission_authors AS sa ON ce.email = sa.email
LEFT OUTER JOIN users AS u ON ce.email = u.email
/* EDIT: Exclude-join clause added based on comments below */
WHERE bl.email IS NULL
AND sa.email IS NULL
AND u.email IS NULL
Here you're limiting your initial GROUP
ed data set before the JOIN
s, which is significantly more optimal.
Although given the context of your original query, the LEFT OUTER JOIN
tables dom't seem to be used at all, so the below would probably return the exact same results with even less overhead:
SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING count(id) > 3
What exactly is the point of those JOIN
ed tables? the LEFT JOIN
prevents them from reducing the data any, and you're only looking at the aggregate data from contacted_emails
. Did you mean to use INNER JOIN
instead?
EDIT: You mentioned that the point of the joins is to exclude emails in your existing tables. I modified my first query to do a proper exclude join (this was a bug in your originally posted code).
Here's another possible option that may perform well for you:
SELECT
FROM contacted_emails
LEFT JOIN (
SELECT email FROM blacklist
UNION ALL SELECT email FROM submission_authors
UNION ALL SELECT email FROM users
) AS existing ON contacted_emails.email = existing.email
WHERE existing.email IS NULL
GROUP BY contacted_emails.email
HAVING COUNT(id) > 3
What I'm doing here is gathering the existing emails in a subquery and doing a single exclude join on that derived table.
Another way you may try to express this is as a non-correlated subquery in the WHERE clause:
SELECT
FROM contacted_emails
WHERE email NOT IN (
SELECT email FROM blacklist
UNION ALL SELECT email FROM submission_authors
UNION ALL SELECT email FROM users
)
GROUP BY email
HAVING COUNT(id) > 3
Try them all and see which gives the best execution plan in MySQL
A couple of thoughts, in terms of the query you may find it faster if you
count(*) row_count
and change the HAVING
to
row_count > 3
as this can be satisfied from the contacted_emails.email
index without having to access the row to get the contacted_emails.id
. As both fields are NOT NULL
and contacted_emails
is the base table this should be the same logic.
As this query will only lengthen as you collect more data, I would suggest a summary table where you store the counts (possibly per some time unit). This can either be update periodically with a cronjob or on the fly with triggers and/or application logic.
If you use a per time unit option on created_at and/or store the last update to the cron, you should be able to get live results by pulling in and appending the latest data.
Any cache solution would have to be adjusted anyway to stay live and the full query run every time the data is cleared/updated.
As suggested in the comments, the database is built for aggregating large amounts of data.. PHP isn't.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With