Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the proper scopes/indices to help to make scoped finds more performant through rails?

I have a relatively large 4-deep relational data setup like this:

ClientApplication         has_many => ClientApplicationVersions
ClientApplicationVersions has_many => CloudLogs
CloudLogs                 has_many => Logs

client_applications: (potentially 1,000's of records)
   - ...
   - account_id
   - public_key
   - deleted_at

client_application_versions: (potentially 10,000's of records)
   - ...
   - client_application_id
   - public_key
   - deleted_at

cloud_logs: (potentially 1,000,000's of records)
   - ...
   - client_application_version_id
   - public_key
   - deleted_at

logs: (potentially 1,000,000,000's of records)
   - ...
   - cloud_log_id
   - public_key
   - time_stamp
   - deleted_at


I am still in development so the structure and setup is not set in stone, but I hope it is setup ok. Using Rails 3.2.11 and InnoDB MySQL. The database is currently filled with a small (compared to the eventual db size) set of data (logs only has 500,000 rows) I have 4 scoped queries, 3 of which are problematic, to retrieve logs.

  1. Grab first page of logs, ordered by timestamp, limited by account_id, client_application.public_key, client_application_version.public_key (Over 100 seconds)
  2. Grab first page of logs, ordered by timestamp, limited by account_id, client_application.public_key (Over 100 seconds)
  3. Grab first page of logs, ordered by timestamp, limited by account_id (Over 100 seconds)
  4. Grab first page of logs, ordered by timestamp (~2 seconds)

I am using rails scopes to help make these calls:

  scope :account_id, proc {|account_id| joins(:client_application).where("client_applications.account_id = ?", account_id) }
  scope :client_application_key, proc {|client_application_key| joins(:client_application).where("client_applications.public_key = ?", client_application_key) }
  scope :client_application_version_key, proc {|client_application_version_key| joins(:client_application_version).where("client_application_versions.public_key = ?", client_application_version_key) }

  default_scope order('logs.timestamp DESC')

I have indices on each table on public_key. I have several indices on the logs table including the one that the optimizer prefers to use (index_logs_on_cloud_log_id), but the queries are still taking eons to run.


Here is how I am calling the method in rails console:

Log.account_id(1).client_application_key('p0kZudG0').client_application_version_key('0HgoJRyE').page(1)

... here is what rails turns it into:

SELECT `logs`.* FROM `logs` INNER JOIN `cloud_logs` ON `cloud_logs`.`id` = `logs`.`cloud_log_id` INNER JOIN `client_application_versions` ON `client_application_versions`.`id` = `cloud_logs`.`client_application_version_id` INNER JOIN `client_applications` ON `client_applications`.`id` = `client_application_versions`.`client_application_id` INNER JOIN `cloud_logs` `cloud_logs_logs_join` ON `cloud_logs_logs_join`.`id` = `logs`.`cloud_log_id` INNER JOIN `client_application_versions` `client_application_versions_logs` ON `client_application_versions_logs`.`id` = `cloud_logs_logs_join`.`client_application_version_id` WHERE (logs.deleted_at IS NULL) AND (client_applications.account_id = 1) AND (client_applications.public_key = 'p0kZudG0') AND (client_application_versions.public_key = '0HgoJRyE') ORDER BY logs.timestamp DESC LIMIT 100 OFFSET 0

... and here is the EXPLAIN statement for that query.

+----+-------------+----------------------------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+---------+------------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table                            | type   | possible_keys                                                                                                                                         | key                                               | key_len | ref                                                                    | rows | Extra                                        |
+----+-------------+----------------------------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+---------+------------------------------------------------------------------------+------+----------------------------------------------+
|  1 | SIMPLE      | client_application_versions      | ref    | PRIMARY,index_client_application_versions_on_client_application_id,index_client_application_versions_on_public_key                                    | index_client_application_versions_on_public_key   | 768     | const                                                                  |    1 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | client_applications              | eq_ref | PRIMARY,index_client_applications_on_account_id,index_client_applications_on_public_key                                                               | PRIMARY                                           | 4       | cloudlog_production.client_application_versions.client_application_id  |    1 | Using where                                  |
|  1 | SIMPLE      | cloud_logs                       | ref    | PRIMARY,index_cloud_logs_on_client_application_version_id                                                                                             | index_cloud_logs_on_client_application_version_id | 5       | cloudlog_production.client_application_versions.id                     |  481 | Using where; Using index                     |
|  1 | SIMPLE      | cloud_logs_logs_join             | eq_ref | PRIMARY,index_cloud_logs_on_client_application_version_id                                                                                             | PRIMARY                                           | 4       | cloudlog_production.cloud_logs.id                                      |    1 |                                              |
|  1 | SIMPLE      | client_application_versions_logs | eq_ref | PRIMARY                                                                                                                                               | PRIMARY                                           | 4       | cloudlog_production.cloud_logs_logs_join.client_application_version_id |    1 | Using index                                  |
|  1 | SIMPLE      | logs                             | ref    | index_logs_on_cloud_log_id_and_deleted_at_and_timestamp,index_logs_on_cloud_log_id_and_deleted_at,index_logs_on_cloud_log_id,index_logs_on_deleted_at | index_logs_on_cloud_log_id                        | 5       | cloudlog_production.cloud_logs.id                                      |    4 | Using where                                  |
+----+-------------+----------------------------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+---------+------------------------------------------------------------------------+------+----------------------------------------------+

This question has 3 parts to it:

  1. Can I optimize my DB with additional indices to help these type of join-dependent sort queries become more performant?
  2. Can I optimize the rails code to help this type of find run in a more performant way?
  3. Am I simply approaching this scoped find the wrong way for large datasets?




UPDATE 1/24/12
As suggested by Geoff and J_MCCaffrey in the answers, I have split the query up into 3 different sections to try and isolate the problem. As expected, it is a problem dealing with the largest table. The MYSQL optimizer handles this differently by using different indices, but the delay persists. Here is the EXPLAIN for this approach.

ClientApplication.find_by_account_id_and_public_key(1, 'p0kZudG0').versions.select{|cav| cav.public_key = '0HgoJRyE'}.first.logs.page(2)
  ClientApplication Load (165.9ms)  SELECT `client_applications`.* FROM `client_applications` WHERE `client_applications`.`account_id` = 1 AND `client_applications`.`public_key` = 'p0kZudG0' AND (client_applications.deleted_at IS NULL) ORDER BY client_applications.id LIMIT 1
  ClientApplicationVersion Load (105.1ms)  SELECT `client_application_versions`.* FROM `client_application_versions` WHERE `client_application_versions`.`client_application_id` = 3 AND (client_application_versions.deleted_at IS NULL) ORDER BY client_application_versions.created_at DESC, client_application_versions.id DESC
  Log Load (57295.0ms)  SELECT `logs`.* FROM `logs` INNER JOIN `cloud_logs` ON `logs`.`cloud_log_id` = `cloud_logs`.`id` WHERE `cloud_logs`.`client_application_version_id` = 49 AND (logs.deleted_at IS NULL) AND (cloud_logs.deleted_at IS NULL) ORDER BY logs.timestamp DESC, cloud_logs.received_at DESC LIMIT 100 OFFSET 100
  EXPLAIN (214.5ms)  EXPLAIN SELECT `logs`.* FROM `logs` INNER JOIN `cloud_logs` ON `logs`.`cloud_log_id` = `cloud_logs`.`id` WHERE `cloud_logs`.`client_application_version_id` = 49 AND (logs.deleted_at IS NULL) AND (cloud_logs.deleted_at IS NULL) ORDER BY logs.timestamp DESC, cloud_logs.received_at DESC LIMIT 100 OFFSET 100
EXPLAIN for: SELECT  `logs`.* FROM `logs` INNER JOIN `cloud_logs` ON `logs`.`cloud_log_id` = `cloud_logs`.`id` WHERE `cloud_logs`.`client_application_version_id` = 49 AND (logs.deleted_at IS NULL) AND (cloud_logs.deleted_at IS NULL) ORDER BY logs.timestamp DESC, cloud_logs.received_at DESC LIMIT 100 OFFSET 100
+----+-------------+------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+---------+-----------------------------------+------+-------------------------------------------------------------------------------------------------------------------------------------------------+
| id | select_type | table      | type        | possible_keys                                                                                                                                         | key                                                                              | key_len | ref                               | rows | Extra                                                                                                                                           |
+----+-------------+------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+---------+-----------------------------------+------+-------------------------------------------------------------------------------------------------------------------------------------------------+
|  1 | SIMPLE      | cloud_logs | index_merge | PRIMARY,index_cloud_logs_on_client_application_version_id,index_cloud_logs_on_deleted_at                                                              | index_cloud_logs_on_client_application_version_id,index_cloud_logs_on_deleted_at | 5,9     | NULL                              | 1874 | Using intersect(index_cloud_logs_on_client_application_version_id,index_cloud_logs_on_deleted_at); Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | logs       | ref         | index_logs_on_cloud_log_id_and_deleted_at_and_timestamp,index_logs_on_cloud_log_id_and_deleted_at,index_logs_on_cloud_log_id,index_logs_on_deleted_at | index_logs_on_cloud_log_id                                                       | 5       | cloudlog_production.cloud_logs.id |    4 | Using where                                                                                                                                     |
+----+-------------+------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------+---------+-----------------------------------+------+-------------------------------------------------------------------------------------------------------------------------------------------------+




UPDATE 1/25/12
Here are the indices for all relevant tables:

CLIENT_APPLICATIONS:
  PRIMARY KEY  (`id`),
  UNIQUE KEY `index_client_applications_on_key` (`key`),
  KEY `index_client_applications_on_account_id` (`account_id`),
  KEY `index_client_applications_on_deleted_at` (`deleted_at`),
  KEY `index_client_applications_on_public_key` (`public_key`)

CLIENT_APPLICATION_VERSIONS:
  PRIMARY KEY  (`id`),
  KEY `index_client_application_versions_on_client_application_id` (`client_application_id`),
  KEY `index_client_application_versions_on_deleted_at` (`deleted_at`),
  KEY `index_client_application_versions_on_public_key` (`public_key`)

CLOUD_LOGS:
  PRIMARY KEY  (`id`),
  KEY `index_cloud_logs_on_api_client_version_id` (`api_client_version_id`),
  KEY `index_cloud_logs_on_client_application_version_id` (`client_application_version_id`),
  KEY `index_cloud_logs_on_deleted_at` (`deleted_at`),
  KEY `index_cloud_logs_on_device_id` (`device_id`),
  KEY `index_cloud_logs_on_public_key` (`public_key`),
  KEY `index_cloud_logs_on_received_at` (`received_at`)

LOGS:
  PRIMARY KEY  (`id`),
  KEY `index_logs_on_class_name` (`class_name`),
  KEY `index_logs_on_cloud_log_id_and_deleted_at_and_timestamp` (`cloud_log_id`,`deleted_at`,`timestamp`),
  KEY `index_logs_on_cloud_log_id_and_deleted_at` (`cloud_log_id`,`deleted_at`),
  KEY `index_logs_on_cloud_log_id` (`cloud_log_id`),
  KEY `index_logs_on_deleted_at` (`deleted_at`),
  KEY `index_logs_on_file_name` (`file_name`),
  KEY `index_logs_on_method_name` (`method_name`),
  KEY `index_logs_on_public_key` (`public_key`),
  KEY `index_logs_on_timestamp` USING BTREE (`timestamp`)
like image 980
coneybeare Avatar asked Jan 19 '13 19:01

coneybeare


1 Answers

Displaying it with a better structure, the query looks like this (already re-arranged)

SELECT
  `logs`.*
FROM
  `logs` as l
  INNER JOIN `cloud_logs` as cl1
    ON
      cl1.id = l.cloud_log_id
  INNER JOIN `cloud_logs` as cl2
    ON
      cl2.id = l.cloud_log_id
  INNER JOIN `client_application_versions` as cav1
    ON
      cav1.id = cl1.client_application_version_id
  INNER JOIN `client_application_versions` as cav2
    ON
      cav2.id = cl2.client_application_version_id
  INNER JOIN `client_applications` as ca
    ON
      ca.id = cav1.client_application_id
WHERE
  (l.deleted_at IS NULL)
    AND
  (ca.account_id = 1)
    AND
  (ca.public_key = 'p0kZudG0')
    AND
  (cav.public_key = '0HgoJRyE')
ORDER BY
  logs.timestamp DESC
LIMIT
  0, 100

When looking at cav1/cl1 and cav2/cl2 one can see, that cav2 and cl2 are never used. There is no filter applied to them apart from the ON statement.

Therefore cav1 is linked to the correct accoutn, cav2 is not linked to any account, and rather contains all matching accounts. This is no problem for the result of the query, but for the size of the join buffer.

Removing the joins (and ON-parts of them) yields:

SELECT
  `logs`.*
FROM
  `logs` as l
  INNER JOIN `cloud_logs` as cl1
    ON
      cl1.id = l.cloud_log_id
--  INNER JOIN `cloud_logs` as cl2
--    ON
--      cl2.id = l.cloud_log_id
  INNER JOIN `client_application_versions` as cav1 use index for join (`index_cavs_on_client_application_id_and_public_key`)
    ON
      cav1.id = cl1.client_application_version_id
        AND
      cav1.public_key = '0HgoJRyE'

--  INNER JOIN `client_application_versions` as cav2
--    ON
--      cav2.id = cl2.client_application_version_id
  INNER JOIN `client_applications` as ca
    ON
      ca.id = cav1.client_application_id
WHERE
  (l.deleted_at IS NULL)
    AND
  (ca.account_id = 1)
    AND
  (ca.public_key = 'p0kZudG0')
ORDER BY
  logs.timestamp DESC
LIMIT
  0, 100

This one should be faster.

Packing this into something useable in console (assuming correct table relations and meta_where):

Log.where(:deleted_at.ne => nil).order("logs.timestamp desc").joins(:cloud_logs) & \
CloudLog.joins(:client_application_versions) & \
ClientApplicationVersion.where(:public_key => '0HgoJRyE').joins(:client_applications) & \
ClientApplication.where(:public_key => 'p0kZudG0', :account_id => 1)

Since i can't reproduce this here, you might have to try for yourself (or add a to_sql to the end) And also add an explain to the shortened query above.

The results might be interesting.

update: after seeing the result and definitions (comments below):

try adding a key:

alter table client_application_versions add key (`client_application_id`, `public_key`);

this should prevent the filesort and speed things up.

edit: updated query to hint mysql about the key.

like image 64
scones Avatar answered Nov 18 '22 01:11

scones