Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL: Optimal index for between queries

Tags:

indexing

mysql

I have a table with the following structure:

CREATE TABLE `geo_ip` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `start_ip` int(10) unsigned NOT NULL,
  `end_ip` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  KEY `start_ip` (`start_ip`),
  KEY `end_ip` (`end_ip`),
  KEY `start_end` (`start_ip`,`end_ip`),
  KEY `end_start` (`end_ip`,`start_ip`)) ENGINE=InnoDB;

MySQL seems to be unable to use the indexes for most of my queries, as the where clause uses a between that falls somewhere between start_ip and end_ip:

select * from geo_ip where 2393196360 between start_ip and end_ip;

+----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+
| id | select_type | table  | type | possible_keys                       | key  | key_len | ref  | rows    | Extra       |
+----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | geo_ip | ALL  | start_ip,end_ip,start_end,end_start | NULL | NULL    | NULL | 2291578 | Using where |
+----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+

The table has a few million records. I tried expanding the table, by removing the start_ip and end_ip columns, and creating a row for every possible value of start_ip and end_ip as the id, then querying the id. While that vastly improved query performance, it resulted in the table size growing from less than a gigabyte to tens of gigabytes (the table has other columns obviously).

What else can be done to improve query performance? Can I change the query somehow, or can I index the columns differently to result in a hit? Or perhaps something I haven't thought of yet?

Edit:

Strangely, the index is used for certain values. For example:

explain select * from geo_ip where 3673747503 between start_ip and end_ip;
+----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+
| id | select_type | table  | type  | possible_keys                       | key    | key_len | ref  | rows  | Extra       |
+----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+
|  1 | SIMPLE      | geo_ip | range | start_ip,end_ip,start_end,end_start | end_ip | 4       | NULL | 19134 | Using where |
+----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+
like image 605
Jeshurun Avatar asked Mar 02 '14 08:03

Jeshurun


People also ask

Which columns should be indexed in MySQL?

column the first in the composite index (in my case that would be the Showtime column). The only problem with that is that the index can only be used by the database if the first column is included in the search query, which it currently isn't in either of my queries.

How does MySQL choose which index to use?

If there is a choice between multiple indexes, MySQL normally uses the index that finds the smallest number of rows (the most selective index). If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows.

Does index improve query performance?

Indexes in Oracle and other databases are objects that store references to data in other tables. They are used to improve the query performance, most often the SELECT statement. They aren't a “silver bullet” – they don't always solve performance problems with SELECT statements. However, they can certainly help.


5 Answers

Adding indices will help.

Note: If your query is sth like

where x between a and b AND y between c and d

, a INDEX(x, y) will not improve the performance, but two seperate indices for x and y will.

like image 111
phil294 Avatar answered Oct 28 '22 15:10

phil294


I've just run into the same problem. Since nobody answered the "WHY", and I figured it out, I'll write here an explanation for all future readers.

First, let's dissect the query.

where 2393196360 between start_ip and end_ip

really means

where start_ip <= C and end_ip >= C

so the engine will first use the index on start_ip, end_ip to fetch all rows for which start_ip is smaller than C, and then further filter out the rows for which end_ip is also bigger than C.

When the engine looks for start_ip <= C, and C is a value big enough such that most, or all start_ips are smaller than C, this "first pass" will result in a lot of rows. It will happen every time C is an IP on the higher end of the IP range.

Now, here's the main thing to realise: our dataset is made in such a way that for each start_ip, there is only an end_ip value, and this end_ip value is guaranteed to be lower than the next record's start_ip value. We are partitioning a range and the partitions do not overlap. But, in the general case, when it comes to two table fields, this does not have to be the case!

So, after the 'first pass', the engine will have to look through ALL records that match start_ip <= C to make sure that they also match end_ip >= C, despite the index. Having end_ip as part of the compound index does not do much in our case; it would help only if we had multiple values for end_ip for each value start_ip, but we only have 1. To give you an example, pretend that the columns were populated with the following data:

start_ip  end_ip
1         10001
1         10002
1         10003
------------
2         10001
2         10002
2         10003
------------
...
------------
9999      10001
9999      10002
9999      10003

if you ran a query with start_ip <= 10000 AND end_ip >= 10000, notice that ALL rows match the expression. On the other hand, in our case, with our ip-ranges dataset, we have the guarantee that only ONE record will match any start_ip <= C AND end_ip >= C expression, thanks to the way the ip data is structured. Specifically the record with the biggest value for start_ip, among all those that match start_ip <= C. That's why adding ORDER BY and LIMIT 1 works in this case, and is the cleanest solution, in my opinion.


Edit: I've just noticed that adding the ORDER BY start_ip DESC and LIMIT clauses may not be enough in some cases. If you run the query with a value that is not covered by any ranges in your data, for instance with private IPs like 127.0.0.1 or 192.168.*, the engine will still look at all records that match the start_ip <= C expression, and the query will be slow. That's because since no records matches the the second part of the expression (end_ip >= C), the LIMIT 1 clause never kicks in.

The solution I've found is to construct the query with a join so as to force the engine to first grab the record with the biggest value for start_ip where start_ip <= C, and only then check if end_ip is also >= C. Like this:

SELECT * 
FROM 
  ( select id FROM geo_ip WHERE start_ip <= C ORDER BY start_ip DESC LIMIT 1 ) limit_ip
  INNER JOIN geo_ip ON limit_ip.id = geo_ip.id
WHERE geo_ip.end_ip >= C

This query will perform a single lookup, whether or not the specific ip C is covered by the ranges in the table, and it only requires a single index on start_ip (as well as id as the primary key).

like image 41
Daniel Avatar answered Oct 28 '22 16:10

Daniel


Not sure why, but adding an order by clause and limit to the query seems to always result in an index hit, and executes in a few milliseconds instead of a few seconds.

explain select * from geo_ip where 2393196360 between start_ip and end_ip order by start_ip desc limit 1;
+----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+
| id | select_type | table  | type  | possible_keys   | key      | key_len | ref  | rows   | Extra       |
+----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+
|  1 | SIMPLE      | geo_ip | range | start_ip,end_ip | start_ip | 4       | NULL | 975222 | Using where |
+----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+

This is good enough for me now, although I would love to know the reasoning behind why the optimizer decides not to use the index in the other case.

like image 26
Jeshurun Avatar answered Oct 28 '22 16:10

Jeshurun


Best index for BETWEEN queries are B-TREE indices. See MySQL docs on that Topic.

ALTER TABLE myTable ADD INDEX myIdx USING BTREE (myCol)
like image 25
Benvorth Avatar answered Oct 28 '22 15:10

Benvorth


If you create an index for start_ip and one for end_ip, I found I could get comparable results to Jeshurun's results without doing the order by, using an inner join with the same table:

select a.* from geo_ip a inner join geo_ip b on a.id=b.id where 2393196360 >= a.start_ip and 2393196360 <= b.end_ip limit 1;

Also you will find MySQL uses a partial index instead of reporting a full-index scan which is more comforting to me.

like image 37
SeanO Avatar answered Oct 28 '22 16:10

SeanO