Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow MySQL Query - Cache the data in a PHP array?

Tags:

arrays

php

mysql

I need to select some data from MySQL DB using PHP. It can be done within one single MySQL query which takes 5 minutes to run on a good server (multiple JOINs on tables with more that 10 Mio rows).

I was wondering if it is a better practice to split the query in PHP and use some loops, rather than MySQL. Also, would it be better to query all the emails from one table with 150 000 rows in an array and then check the array instead of doing thousands of MySQL SELECTs.

Here is the Query:

SELECT count(contacted_emails.id), contacted_emails.email 
FROM contacted_emails
LEFT OUTER JOIN blacklist ON contacted_emails.email = blacklist.email
LEFT OUTER JOIN submission_authors ON contacted_emails.email = submission_authors.email
LEFT OUTER JOIN users ON contacted_emails.email = users.email
GROUP BY contacted_emails.email
HAVING count(contacted_emails.id) > 3

The EXPLAIN returns: EXPLAIN

The indexes in the 4 tables are:

contacted_emails: id, blacklist_section_id, journal_id and mail
blacklist: id, email and name
submission_authors: id, hash_key and email
users: id, email, firstname, lastname, editor_id, title_id, country_id, workplace_id

jobtype_id

The table contacted_emails is created like:

CREATE TABLE contacted_emails ( 
  id int(10) unsigned NOT NULL AUTO_INCREMENT, 
  email varchar(150) COLLATE utf8_unicode_ci NOT NULL,
  contacted_at datetime NOT NULL, 
  created_at datetime NOT NULL, 
  blacklist_section_id int(11) unsigned NOT NULL,
  journal_id int(10) DEFAULT NULL, 
  PRIMARY KEY (id), 
  KEY blacklist_section_id (blacklist_section_id), 
  KEY journal_id (journal_id), 
  KEY email (email) ) 
ENGINE=InnoDB AUTO_INCREMENT=4491706 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
like image 803
Milos Cuculovic Avatar asked Aug 21 '15 15:08

Milos Cuculovic


2 Answers

Your indexes look fine.

The performance problems seem to come from the fact that you're JOINing all rows, then filtering using HAVING.

This would probably work better instead:

SELECT * 
FROM (
    SELECT email, COUNT(id) AS number_of_contacts
    FROM contacted_emails
    GROUP BY email
    HAVING COUNT(id) > 3
) AS ce
LEFT OUTER JOIN blacklist AS bl ON ce.email = bl.email
LEFT OUTER JOIN submission_authors AS sa ON ce.email = sa.email
LEFT OUTER JOIN users AS u ON ce.email = u.email
/* EDIT: Exclude-join clause added based on comments below */
WHERE bl.email IS NULL
    AND sa.email IS NULL
    AND u.email IS NULL

Here you're limiting your initial GROUPed data set before the JOINs, which is significantly more optimal.

Although given the context of your original query, the LEFT OUTER JOIN tables dom't seem to be used at all, so the below would probably return the exact same results with even less overhead:

SELECT email, COUNT(id) AS number_of_contacts
FROM contacted_emails
GROUP BY email
HAVING count(id) > 3

What exactly is the point of those JOINed tables? the LEFT JOIN prevents them from reducing the data any, and you're only looking at the aggregate data from contacted_emails. Did you mean to use INNER JOIN instead?


EDIT: You mentioned that the point of the joins is to exclude emails in your existing tables. I modified my first query to do a proper exclude join (this was a bug in your originally posted code).

Here's another possible option that may perform well for you:

SELECT 
FROM contacted_emails
LEFT JOIN (
    SELECT email FROM blacklist
    UNION ALL SELECT email FROM submission_authors
    UNION ALL SELECT email FROM users
) AS existing ON contacted_emails.email = existing.email
WHERE existing.email IS NULL
GROUP BY contacted_emails.email
HAVING COUNT(id) > 3

What I'm doing here is gathering the existing emails in a subquery and doing a single exclude join on that derived table.

Another way you may try to express this is as a non-correlated subquery in the WHERE clause:

SELECT 
FROM contacted_emails
WHERE email NOT IN (
    SELECT email FROM blacklist
    UNION ALL SELECT email FROM submission_authors
    UNION ALL SELECT email FROM users
)
GROUP BY email
HAVING COUNT(id) > 3

Try them all and see which gives the best execution plan in MySQL

like image 110
Steven Moseley Avatar answered Nov 09 '22 07:11

Steven Moseley


A couple of thoughts, in terms of the query you may find it faster if you

count(*) row_count 

and change the HAVING to

row_count > 3

as this can be satisfied from the contacted_emails.email index without having to access the row to get the contacted_emails.id. As both fields are NOT NULL and contacted_emails is the base table this should be the same logic.

As this query will only lengthen as you collect more data, I would suggest a summary table where you store the counts (possibly per some time unit). This can either be update periodically with a cronjob or on the fly with triggers and/or application logic.

If you use a per time unit option on created_at and/or store the last update to the cron, you should be able to get live results by pulling in and appending the latest data.

Any cache solution would have to be adjusted anyway to stay live and the full query run every time the data is cleared/updated.

As suggested in the comments, the database is built for aggregating large amounts of data.. PHP isn't.

like image 21
Arth Avatar answered Nov 09 '22 05:11

Arth