Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can this subquery be made to use the index?

First, apologies in advance for the wall of text. I did read through every similar question/answer I could find, but either the answers didn't seem applicable to my query, or I needed more clarity to understand the underlying problem and solution.

I have a table of file sizes along with related file dates and observation timestamps. All dates are UNIX epoch time integers in seconds:

mysql> describe name_servers;
+-----------------------+------------------+------+-----+---------+----------------+
| Field                 | Type             | Null | Key | Default | Extra          |
+-----------------------+------------------+------+-----+---------+----------------+
| server_name           | varchar(255)     | YES  |     | NULL    |                |
| file_date             | int(10) unsigned | YES  |     | NULL    |                |
| file_size             | int(10) unsigned | YES  |     | NULL    |                |
| time                  | int(10) unsigned | YES  | MUL | NULL    |                |
| poll_id               | int(11)          | NO   | PRI | NULL    | auto_increment |
+-----------------------+------------------+------+-----+---------+----------------+
5 rows in set (0.01 sec)


mysql> show index from name_servers;
+--------------+------------+--------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table        | Non_unique | Key_name                 | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+--------------+------------+--------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| name_servers |          0 | PRIMARY                  |            1 | poll_id     | A         |     3523218 |     NULL | NULL   |      | BTREE      |         |               |
| name_servers |          0 | index_time_servername    |            1 | time        | A         |      503316 |     NULL | NULL   | YES  | BTREE      |         |               |
| name_servers |          0 | index_time_servername    |            2 | server_name | A         |     3523218 |     NULL | NULL   | YES  | BTREE      |         |               |
+--------------+------------+--------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
3 rows in set (0.00 sec)

I have to track file size changes to detect if the file shrinks by > 20% in any 48 hour period. Normally I would try to do this with MySQL Window functions, but they are not supported by the MySQL version on my server (5.6.37 -- which I have no control over as the server is not managed by my team). Presently I obtain the current size and maximum size (in the last 48 hours) by means of an outside query that finds the file size in the current row, and an inside subquery that finds the largest file size in the previous 48 hours (172,800 seconds) worth of rows:

mysql> select name_servers_outside.server_name,
    -> name_servers_outside.file_size,
    -> name_servers_outside.file_date,
    -> name_servers_outside.time,
    -> (select max(file_size) from name_servers where time > (name_servers_outside.time - 172800) and server_name = 'example_server') as max_file_size
    -> from name_servers as name_servers_outside
    -> where name_servers_outside.server_name = 'example_server'
    -> and name_servers_outside.time > (UNIX_TIMESTAMP() - 172800)
    -> limit 10;
+-------------------+-------------------+-------------------+------------+-----------------------+
| server_name       | file_size         | file_date         | time       | max_file_size         |
+-------------------+-------------------+-------------------+------------+-----------------------+
| example_server    |           1159544 |        1550382945 | 1550382985 |               1159580 |
| example_server    |           1159544 |        1550382945 | 1550383195 |               1159580 |
| example_server    |           1159544 |        1550382945 | 1550383255 |               1159580 |
| example_server    |           1159544 |        1550382945 | 1550383316 |               1159580 |
| example_server    |           1159544 |        1550382945 | 1550383376 |               1159580 |
| example_server    |           1159544 |        1550382945 | 1550383435 |               1159580 |
| example_server    |           1159544 |        1550382945 | 1550383496 |               1159580 |
| example_server    |           1159544 |        1550382945 | 1550383555 |               1159580 |
| example_server    |           1159544 |        1550382945 | 1550383616 |               1159580 |
| example_server    |           1159544 |        1550382945 | 1550383676 |               1159580 |
+-------------------+-------------------+-------------------+------------+-----------------------+
10 rows in set (16.11 sec)

Just retrieving those 10 rows took 16 sec, and in production, this query will have to retrieve over 150 rows. The inner query is doing a complete scan of all 3 million+ table rows, with the message "Range checked for each record (index map: 0x2)":

mysql> explain
    -> select name_servers_outside.server_name,
    -> name_servers_outside.file_size,
    -> name_servers_outside.file_date,
    -> name_servers_outside.time,
    -> (select max(file_size) from name_servers where time > (name_servers_outside.time - 172800) and server_name = 'example_server') as max_file_size
    -> from name_servers as name_servers_outside
    -> where name_servers_outside.server_name = 'example_server'
    -> and name_servers_outside.time > (UNIX_TIMESTAMP() - 172800);
+----+--------------------+----------------------+-------+--------------------------+--------------------------+---------+------+---------+------------------------------------------------+
| id | select_type        | table                | type  | possible_keys            | key                      | key_len | ref  | rows    | Extra                                          |
+----+--------------------+----------------------+-------+--------------------------+--------------------------+---------+------+---------+------------------------------------------------+
|  1 | PRIMARY            | name_servers_outside | range | index_time_servername    | index_time_servername    | 5       | NULL |   47302 | Using index condition; Using MRR               |
|  2 | DEPENDENT SUBQUERY | name_servers         | ALL   | index_time_servername    | NULL                     | NULL    | NULL | 3533883 | Range checked for each record (index map: 0x2) |
+----+--------------------+----------------------+-------+--------------------------+--------------------------+---------+------+---------+------------------------------------------------+
2 rows in set (0.01 sec)

The problematic part seems to be this:

time > (name_servers_outside.time - 172800)

If I run a similar query using a static integer value instead of the "name_servers_outside.time" column reference in the subquery, the indexes are used as expected and the query is fast:

time > (UNIX_TIMESTAMP() - 172800)

The modified query:

mysql> select name_servers_outside.server_name,
    -> name_servers_outside.file_size,
    -> name_servers_outside.file_date,
    -> name_servers_outside.time,
    -> (select max(file_size) from name_servers where time > (UNIX_TIMESTAMP() - 172800) and server_name = 'example_server') as max_file_size
    -> from name_servers as name_servers_outside
    -> where name_servers_outside.server_name = 'example_server'
    -> and name_servers_outside.time > (UNIX_TIMESTAMP() - 172800)
    -> limit 10;
+--------------------+-------------------+-------------------+------------+-----------------------+
| server_name        | file_size         | file_date         | time       | max_file_size         |
+--------------------+-------------------+-------------------+------------+-----------------------+
| example_server     |           1159544 |        1550382945 | 1550382985 |               1159580 |
| example_server     |           1159544 |        1550382945 | 1550383195 |               1159580 |
| example_server     |           1159544 |        1550382945 | 1550383255 |               1159580 |
| example_server     |           1159544 |        1550382945 | 1550383316 |               1159580 |
| example_server     |           1159544 |        1550382945 | 1550383376 |               1159580 |
| example_server     |           1159544 |        1550382945 | 1550383435 |               1159580 |
| example_server     |           1159544 |        1550382945 | 1550383496 |               1159580 |
| example_server     |           1159544 |        1550382945 | 1550383555 |               1159580 |
| example_server     |           1159544 |        1550382945 | 1550383616 |               1159580 |
| example_server     |           1159544 |        1550382945 | 1550383676 |               1159580 |
+--------------------+-------------------+-------------------+------------+-----------------------+
10 rows in set (0.01 sec)


mysql> explain
    -> select name_servers_outside.server_name,
    -> name_servers_outside.file_size,
    -> name_servers_outside.file_date,
    -> name_servers_outside.time,
    -> (select max(file_size) from name_servers where time > (UNIX_TIMESTAMP() - 172800) and server_name = 'example_server') as max_file_size
    -> from name_servers as name_servers_outside
    -> where name_servers_outside.server_name = 'example_server'
    -> and name_servers_outside.time > (UNIX_TIMESTAMP() - 172800)
    -> limit 10;
+----+-------------+----------------------+-------+--------------------------+--------------------------+---------+------+-------+----------------------------------+
| id | select_type | table                | type  | possible_keys            | key                      | key_len | ref  | rows  | Extra                            |
+----+-------------+----------------------+-------+--------------------------+--------------------------+---------+------+-------+----------------------------------+
|  1 | PRIMARY     | name_servers_outside | range | index_time_servername    | index_time_servername    | 5       | NULL | 49042 | Using index condition; Using MRR |
|  2 | SUBQUERY    | name_servers         | range | index_time_servername    | index_time_servername    | 5       | NULL | 49042 | Using index condition; Using MRR |
+----+-------------+----------------------+-------+--------------------------+--------------------------+---------+------+-------+----------------------------------+
2 rows in set (0.00 sec)

Thank you for reading with me this far. I apologize again for the giant wall of text, but I wanted to make sure I included enough explanatory detail to clearly define the problem.

Now, the problem I'm trying to solve is that I need to retrieve the largest value of file_size in the 48 hours preceding each row. Each row therefore has its own unique time range for the "max(file_size)" calculation. This will then be used to calculate the percent file size change. As mentioned above, I normally want to use window functions for this, but they are not supported by my version of MySQL (5.6.37), and it's not within my ability to update to 8.0 as I don't own this server.

As always, any suggestions are appreciated. Thank you for reading!

like image 371
kagerousan Avatar asked Nov 07 '22 20:11

kagerousan


1 Answers

I would first try adding file_size to your index_time_servername index but I suspect the real issue is that you're having to use name_servers_outside.time inside your subquery which being from a different alias is probably confusing the query planner.

So, how about losing the subquery and join the table to itself where time is between time and time-48-hours-ago?

Something like...

SELECT
  name_servers_outside.server_name,
  name_servers_outside.file_size,
  name_servers_outside.file_date,
  name_servers_outside.time,
  MAX(previous.file_size) AS max_file_size
FROM
   name_servers AS ns
INNER JOIN name_servers AS previous 
   ON previous.time BETWEEN (ns.time - 172800) AND (ns.time - 1)
WHERE 
   ns.server_name = 'example_server'
   AND ns.time > (UNIX_TIMESTAMP() - 172800)
GROUP BY
   ns.server_name,
   ns.file_size,
   ns.file_date,
   ns.time
LIMIT 10;
like image 61
Dazz Knowles Avatar answered Nov 14 '22 23:11

Dazz Knowles