Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize SQL that uses between clause

Consider the following 2 tables:

Table A:
id
event_time

Table B
id
start_time
end_time

Every record in table A is mapped to exactly 1 record in table B. This means table B has no overlapping periods. Many records from table A can be mapped to the same record in table B.

I need a query that returns all A.id, B.id pairs. Something like:

SELECT A.id, B.id 
FROM A, B 
WHERE A.event_time BETWEEN B.start_time AND B.end_time

I am using MySQL and I cannot optimize this query. With ~980 records in table A and 130.000 in table B this takes forever. I understand this has to perform 980 queries, but taking more than 15 minutes on a beefy machine is strange. Any suggestions?

P.S. I cannot change the database schema, but I can add indexes. However an index (with 1 or 2 fields) on the time fields doesn't help.

like image 532
daremon Avatar asked Feb 17 '09 15:02

daremon


3 Answers

You may want to try something like this

Select A.ID,
(SELECT B.ID FROM B
WHERE A.EventTime BETWEEN B.start_time AND B.end_time LIMIT 1) AS B_ID
FROM A

If you have an index on the Start_Time,End_Time fields for B, then this should work quite well.

like image 53
Kibbee Avatar answered Oct 12 '22 00:10

Kibbee


I'm not sure this can be optimized fully. I tried it on MySQL 5.1.30. I also added an index on {B.start_time, B.end_time} as suggested by other folks. Then I got a report from EXPLAIN, but the best I could get is a Range Access Method:

EXPLAIN SELECT A.id, B.id FROM A JOIN B 
ON A.event_time BETWEEN B.start_time AND B.end_time;

+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                                          |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+
|  1 | SIMPLE      | A     | ALL  | event_time    | NULL | NULL    | NULL |    8 |                                                | 
|  1 | SIMPLE      | B     | ALL  | start_time    | NULL | NULL    | NULL |   96 | Range checked for each record (index map: 0x4) | 
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------------------------+

See the note on the far right. The optimizer thinks it might be able to use the index on {B.start_time, B.end_time} but it ended up deciding not to use that index. Your results may vary, because your data distribution is more representative.

Compare with the index usage if you compare A.event_time to a constant range:

EXPLAIN SELECT A.id FROM A
WHERE A.event_time BETWEEN '2009-02-17 09:00' and '2009-02-17 10:00';

+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key        | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
|  1 | SIMPLE      | A     | range | event_time    | event_time | 8       | NULL |    1 | Using where | 
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+

And compare with the dependent sub-query form given by @Luke and @Kibbee, which seems to make use of indexes more effectively:

EXPLAIN SELECT A.id AS id_from_a,
    (
        SELECT B.id
        FROM B
        WHERE A.id BETWEEN B.start_time AND B.end_time
        LIMIT 0, 1
    ) AS id_from_b
FROM A;

+----+--------------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type        | table | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+--------------------+-------+-------+---------------+---------+---------+------+------+-------------+
|  1 | PRIMARY            | A     | index | NULL          | PRIMARY | 8       | NULL |    8 | Using index | 
|  2 | DEPENDENT SUBQUERY | B     | ALL   | start_time    | NULL    | NULL    | NULL |  384 | Using where | 
+----+--------------------+-------+-------+---------------+---------+---------+------+------+-------------+

Weirdly, EXPLAIN lists possible_keys as NULL (i.e. no indexes could be used) but then decides to use the primary key after all. Could be an idiosyncrasy of MySQL's EXPLAIN report?

like image 34
Bill Karwin Avatar answered Oct 11 '22 23:10

Bill Karwin


I wouldn't normally recommend a query like this, but...

Since you've specified that table A only has about 980 rows and that each row maps to exactly one row in table B, then you could do the following and it will most likely be a lot faster than a cartesian join:

SELECT A.id AS id_from_a,
    (
        SELECT B.id
        FROM B
        WHERE A.event_time BETWEEN B.start_time AND B.end_time
        LIMIT 0, 1
    ) AS id_from_b
FROM A
like image 41
LukeH Avatar answered Oct 11 '22 22:10

LukeH