Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize query for table with hundreds of millions of rows

this feels like a "do my homework for me" kind of question but I'm really stuck here trying to make this query run quickly against a table with many many rows. Here's a SQLFiddle that shows the schema (more or less).

I've played with the indexes, trying to get something that will show all the required columns but haven't had much success. Here's the create:

CREATE TABLE `AuditEvent` (
    `auditEventId` bigint(20) NOT NULL AUTO_INCREMENT,
    `eventTime` datetime NOT NULL,
    `target1Id` int(11) DEFAULT NULL,
    `target1Name` varchar(100) DEFAULT NULL,
    `target2Id` int(11) DEFAULT NULL,
    `target2Name` varchar(100) DEFAULT NULL,
    `clientId` int(11) NOT NULL DEFAULT '1',
    `type` int(11) not null,
    PRIMARY KEY (`auditEventId`),
    KEY `Transactions` (`clientId`,`eventTime`,`target1Id`,`type`),
    KEY `TransactionsJoin` (`auditEventId`, `clientId`,`eventTime`,`target1Id`,`type`)
)

And (a version of) the select:

select ae.target1Id, ae.type, count(*)
from AuditEvent ae
where ae.clientId=4
    and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type;

I end up with a 'Using temporary' and 'Using filesort' as well. I tried dropping the count(*) and using select distinct instead, which doesn't cause the 'Using filesort'. This would probably be okay if there was a way to join back to get the counts.

Originally, the decision was made to track the target1Name and target2Name of the targets as they existed when the audit record was created. I need those names as well (the most recent will do).

Currently the query (above, with missing target1Name and target2Name columns) runs in about 5 seconds on ~24million records. Our target is in the hundreds of millions and we'd like the query to continue to perform along those lines (hoping to keep it under 1-2 minutes, but we'd like to have it much better), but my fear is once we hit that larger amount of data it won't (work to simulate additional rows is underway).

I'm not sure of the best strategy to get the additional fields. If I add the columns straight into the select I lose the 'Using index' on the query. I tried a join back to the table, which keeps the 'Using index' but takes around 20 seconds.

I did try changing the eventTime column to an int rather than a datetime but that didn't seem to affect the index use or time.

like image 420
Nick Spacek Avatar asked Oct 23 '12 13:10

Nick Spacek


1 Answers

As you probably understand, the problem here is the range condition ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00' which (as it always does) breaks efficient usage of Transactions index (that is index is actually used only for clientId equation and first part of the range condition and the index is not used for grouping).

Most often, the solution is to replace the range condition with an equality check (in your case, introduce a period column, group eventTime into periods and replace the BETWEEN clause with a period IN (1,2,3,4,5)). But this might become an overhead for your table.

Another solution that you might try is to add another index (probably replace Transactions if it is not used anymore): (clientId, target1Id, type, eventTime), and use the following query:

SELECT
  ae.target1Id,
  ae.type,
  COUNT(
    NULLIF(ae.eventTime BETWEEN '2011-09-01 03:00:00' 
                            AND '2012-09-30 23:57:00', 0)
  ) as cnt,
FROM AuditEvent ae
WHERE ae.clientId=4
GROUP BY ae.target1Id, ae.type;

That way, you will a) move the range condition to the end, b) allow using the index for the grouping, c) make the index the covering index for the query (that is the query does not need disk IO operations)

UPD1: I am sorry, yesteday I did not carefully read your post and did not notice that your problem is to retrieve target1Name and target2Name. First of all, I am not sure that you correctly understand the meaning of Using index. The absence of Using index does not mean that no index is used for the query, Using index means that the index itself contains enough data to execute a subquery (that is the index is covering). Since target1Name and target2Name are not included in any index, the subquery that fetches them wil not have Using index.

If you question is just how to add those two fields to your query (which you consider fast enough), then just try the following:

SELECT a1.target1Id, a1.type, cnt, target1Name, target2Name
FROM (
  select ae.target1Id, ae.type, count(*) as cnt, MAX(auditEventId) as max_id
  from AuditEvent ae
  where ae.clientId=4
      and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
  group by ae.target1Id, ae.type) as a1
JOIN AuditEvent a2 ON a1.max_id = a2.auditEventId
;
like image 199
newtover Avatar answered Sep 19 '22 16:09

newtover