Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL magic - query shouldn't take 15 hours, but it does

Ok, so i have one really monstrous MySQL table (900k records, 180 MB total), and i want to extract from subgroups records with higher date_updated and calculate weighted average in each group. The calculation runs for ~15 hours, and i have a strong feeling i'm doing it wrong.

First, monstrous table layout:

  • category
  • element_id
  • date_updated
  • value
  • weight
  • source_prefix
  • source_name

Only key here is on element_id (BTREE, ~8k unique elements).

And calculation process:

Make hash for each group and subgroup.

CREATE TEMPORARY TABLE `temp1` (INDEX ( `ds_hash` ))
                SELECT `category`, 
                `element_id`, 
                `source_prefix`, 
                `source_name`, 
                `date_updated`, 
                `value`, 
                `weight`, 
                MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`, 
                MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash` 
                FROM `bigbigtable` WHERE `date_updated` <= '2009-04-28'

I really don't understand this fuss with hashes, but it worked faster this way. Dark magic, i presume.

Find maximum date for each subgroup

CREATE TEMPORARY TABLE `temp2` (INDEX ( `subcat_hash` ))

                SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash`
                FROM `temp1`
                GROUP BY `subcat_hash`;

Join temp1 with temp2 to find weighted average values for categories

CREATE TEMPORARY TABLE `valuebycats` (INDEX ( `category` ))
            SELECT `temp1`.`element_id`, 
                   `temp1`.`category`, 
                   `temp1`.`source_prefix`, 
                   `temp1`.`source_name`, 
                   `temp1`.`date_updated`, 
                   AVG(`temp1`.`value`) AS `avg_value`,
            SUM(`temp1`.`value` * `temp1`.`weight`) / SUM(`weight`) AS `rating`

            FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash`
            WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash`
            AND `temp1`.`date_updated` = `temp2`.`maxdate`

            GROUP BY `temp1`.`cat_hash`;

(now that i looked through it and wrote it all down, it seems to me that i should use INNER JOIN in that last query (to avoid 900k*900k temp table)).

Still, is there a normal way to do so?

UPD: some picture for reference:

removed dead ImageShack link

UPD: EXPLAIN for proposed solution:

+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key        | key_len | ref                                                                                  | rows   | filtered | Extra                                        |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
|  1 | SIMPLE      | cur   | ALL  | NULL          | NULL       | NULL    | NULL                                                                                 | 893085 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | next  | ref  | prefix        | prefix     | 1074    | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id |      1 |   100.00 | Using where                                  |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+    
like image 938
Kuroki Kaze Avatar asked May 22 '09 10:05

Kuroki Kaze


2 Answers

Using hashses is one of the ways in which a database engine can execute a join. It should be very rare that you'd have to write your own hash-based join; this certainly doesn't look like one of them, with a 900k rows table with some aggregates.

Based on your comment, this query might do what you are looking for:

SELECT cur.source_prefix, 
       cur.source_name, 
       cur.category, 
       cur.element_id,
       MAX(cur.date_updated) AS DateUpdated, 
       AVG(cur.value) AS AvgValue,
       SUM(cur.value * cur.weight) / SUM(cur.weight) AS Rating
FROM eev0 cur
LEFT JOIN eev0 next
    ON next.date_updated < '2009-05-01'
    AND next.source_prefix = cur.source_prefix 
    AND next.source_name = cur.source_name
    AND next.element_id = cur.element_id
    AND next.date_updated > cur.date_updated
WHERE cur.date_updated < '2009-05-01'
AND next.category IS NULL
GROUP BY cur.source_prefix, cur.source_name, 
    cur.category, cur.element_id

The GROUP BY performs the calculations per source+category+element.

The JOIN is there to filter out old entries. It looks for later entries, and then the WHERE statement filters out the rows for which a later entry exists. A join like this benefits from an index on (source_prefix, source_name, element_id, date_updated).

There are many ways of filtering out old entries, but this one tends to perform resonably well.

like image 183
Andomar Avatar answered Sep 28 '22 09:09

Andomar


Ok, so 900K rows isn't a massive table, it's reasonably big but and your queries really shouldn't be taking that long.

First things first, which of the 3 statements above is taking the most time?

The first problem I see is with your first query. Your WHERE clause doesn't include an indexed column. So this means that it has to do a full table scan on the entire table.

Create an index on the "data_updated" column, then run the query again and see what that does for you.

If you don't need the hash's and are only using them to avail of the dark magic then remove them completely.

Edit: Someone with more SQL-fu than me will probably reduce your whole set of logic into one SQL statement without the use of the temporary tables.

Edit: My SQL is a little rusty, but are you joining twice in the third SQL staement? Maybe it won't make a difference but shouldn't it be :

SELECT temp1.element_id, 
   temp1.category, 
   temp1.source_prefix, 
   temp1.source_name, 
   temp1.date_updated, 
   AVG(temp1.value) AS avg_value,
   SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash
WHERE temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;

or

SELECT temp1.element_id, 
   temp1.category, 
   temp1.source_prefix, 
   temp1.source_name, 
   temp1.date_updated, 
   AVG(temp1.value) AS avg_value,
   SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 temp2
WHERE temp2.subcat_hash = temp1.subcat_hash
AND temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;
like image 32
Glen Avatar answered Sep 28 '22 07:09

Glen