I have a table with the following structure: <pre class="prettyprint"><code>CREATE TABLE `geo_ip` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `start_ip` int(10) unsigned NOT NULL, `end_ip` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), KEY `start_ip` (`start_ip`), KEY `end_ip` (`end_ip`), KEY `start_end` (`start_ip`,`end_ip`), KEY `end_start` (`end_ip`,`start_ip`)) ENGINE=InnoDB; </code></pre> MySQL seems to be unable to use the indexes for most of my queries, as the <code>where</code> clause uses a <code>between</code> that falls somewhere between <code>start_ip</code> and <code>end_ip</code>: <pre class="prettyprint"><code>select * from geo_ip where 2393196360 between start_ip and end_ip; +----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | geo_ip | ALL | start_ip,end_ip,start_end,end_start | NULL | NULL | NULL | 2291578 | Using where | +----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+ </code></pre> The table has a few million records. I tried expanding the table, by removing the <code>start_ip</code> and <code>end_ip</code> columns, and creating a row for every possible value of <code>start_ip</code> and <code>end_ip</code> as the <code>id</code>, then querying the <code>id</code>. While that vastly improved query performance, it resulted in the table size growing from less than a gigabyte to tens of gigabytes (the table has other columns obviously). What else can be done to improve query performance? Can I change the query somehow, or can I index the columns differently to result in a hit? Or perhaps something I haven't thought of yet? Edit: Strangely, the index is used for certain values. For example: <pre class="prettyprint"><code>explain select * from geo_ip where 3673747503 between start_ip and end_ip; +----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+ | 1 | SIMPLE | geo_ip | range | start_ip,end_ip,start_end,end_start | end_ip | 4 | NULL | 19134 | Using where | +----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+ </code></pre>

Adding indices will help. Note: If your query is sth like <pre class="prettyprint"><code>where x between a and b AND y between c and d </code></pre> , a <code>INDEX(x, y)</code> will not improve the performance, but two seperate indices for <code>x</code> and <code>y</code> will.

Best index for BETWEEN queries are B-TREE indices. See MySQL docs on that Topic. <pre class="prettyprint"><code>ALTER TABLE myTable ADD INDEX myIdx USING BTREE (myCol) </code></pre>

If you create an index for start_ip and one for end_ip, I found I could get comparable results to Jeshurun's results without doing the order by, using an inner join with the same table: <pre class="prettyprint"><code>select a.* from geo_ip a inner join geo_ip b on a.id=b.id where 2393196360 >= a.start_ip and 2393196360 <= b.end_ip limit 1; </code></pre> Also you will find MySQL uses a partial index instead of reporting a full-index scan which is more comforting to me.

MySQL: Optimal index for between queries

Tags:

indexing

mysql

I have a table with the following structure:

Click to copy

CREATE TABLE `geo_ip` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `start_ip` int(10) unsigned NOT NULL,
  `end_ip` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  KEY `start_ip` (`start_ip`),
  KEY `end_ip` (`end_ip`),
  KEY `start_end` (`start_ip`,`end_ip`),
  KEY `end_start` (`end_ip`,`start_ip`)) ENGINE=InnoDB;

MySQL seems to be unable to use the indexes for most of my queries, as the where clause uses a between that falls somewhere between start_ip and end_ip:

Click to copy

select * from geo_ip where 2393196360 between start_ip and end_ip;

+----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+
| id | select_type | table  | type | possible_keys                       | key  | key_len | ref  | rows    | Extra       |
+----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | geo_ip | ALL  | start_ip,end_ip,start_end,end_start | NULL | NULL    | NULL | 2291578 | Using where |
+----+-------------+--------+------+-------------------------------------+------+---------+------+---------+-------------+

The table has a few million records. I tried expanding the table, by removing the start_ip and end_ip columns, and creating a row for every possible value of start_ip and end_ip as the id, then querying the id. While that vastly improved query performance, it resulted in the table size growing from less than a gigabyte to tens of gigabytes (the table has other columns obviously).

What else can be done to improve query performance? Can I change the query somehow, or can I index the columns differently to result in a hit? Or perhaps something I haven't thought of yet?

Edit:

Strangely, the index is used for certain values. For example:

Click to copy

explain select * from geo_ip where 3673747503 between start_ip and end_ip;
+----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+
| id | select_type | table  | type  | possible_keys                       | key    | key_len | ref  | rows  | Extra       |
+----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+
|  1 | SIMPLE      | geo_ip | range | start_ip,end_ip,start_end,end_start | end_ip | 4       | NULL | 19134 | Using where |
+----+-------------+--------+-------+-------------------------------------+--------+---------+------+-------+-------------+

605

asked Mar 02 '14 08:03

Jeshurun

5 Answers

Adding indices will help.

Note: If your query is sth like

Click to copy

where x between a and b AND y between c and d

, a INDEX(x, y) will not improve the performance, but two seperate indices for x and y will.

111

answered Oct 28 '22 15:10

phil294

I've just run into the same problem. Since nobody answered the "WHY", and I figured it out, I'll write here an explanation for all future readers.

First, let's dissect the query.

Click to copy

where 2393196360 between start_ip and end_ip

really means

Click to copy

where start_ip <= C and end_ip >= C

so the engine will first use the index on start_ip, end_ip to fetch all rows for which start_ip is smaller than C, and then further filter out the rows for which end_ip is also bigger than C.

When the engine looks for start_ip <= C, and C is a value big enough such that most, or all start_ips are smaller than C, this "first pass" will result in a lot of rows. It will happen every time C is an IP on the higher end of the IP range.

Now, here's the main thing to realise: our dataset is made in such a way that for each start_ip, there is only an end_ip value, and this end_ip value is guaranteed to be lower than the next record's start_ip value. We are partitioning a range and the partitions do not overlap. But, in the general case, when it comes to two table fields, this does not have to be the case!

So, after the 'first pass', the engine will have to look through ALL records that match start_ip <= C to make sure that they also match end_ip >= C, despite the index. Having end_ip as part of the compound index does not do much in our case; it would help only if we had multiple values for end_ip for each value start_ip, but we only have 1. To give you an example, pretend that the columns were populated with the following data:

Click to copy

start_ip  end_ip
1         10001
1         10002
1         10003
------------
2         10001
2         10002
2         10003
------------
...
------------
9999      10001
9999      10002
9999      10003

if you ran a query with start_ip <= 10000 AND end_ip >= 10000, notice that ALL rows match the expression. On the other hand, in our case, with our ip-ranges dataset, we have the guarantee that only ONE record will match any start_ip <= C AND end_ip >= C expression, thanks to the way the ip data is structured. Specifically the record with the biggest value for start_ip, among all those that match start_ip <= C. That's why adding ORDER BY and LIMIT 1 works in this case, and is the cleanest solution, in my opinion.

Edit: I've just noticed that adding the ORDER BY start_ip DESC and LIMIT clauses may not be enough in some cases. If you run the query with a value that is not covered by any ranges in your data, for instance with private IPs like 127.0.0.1 or 192.168.*, the engine will still look at all records that match the start_ip <= C expression, and the query will be slow. That's because since no records matches the the second part of the expression (end_ip >= C), the LIMIT 1 clause never kicks in.

The solution I've found is to construct the query with a join so as to force the engine to first grab the record with the biggest value for start_ip where start_ip <= C, and only then check if end_ip is also >= C. Like this:

Click to copy

SELECT * 
FROM 
  ( select id FROM geo_ip WHERE start_ip <= C ORDER BY start_ip DESC LIMIT 1 ) limit_ip
  INNER JOIN geo_ip ON limit_ip.id = geo_ip.id
WHERE geo_ip.end_ip >= C

This query will perform a single lookup, whether or not the specific ip C is covered by the ranges in the table, and it only requires a single index on start_ip (as well as id as the primary key).

answered Oct 28 '22 16:10

Daniel

Not sure why, but adding an order by clause and limit to the query seems to always result in an index hit, and executes in a few milliseconds instead of a few seconds.

Click to copy

explain select * from geo_ip where 2393196360 between start_ip and end_ip order by start_ip desc limit 1;
+----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+
| id | select_type | table  | type  | possible_keys   | key      | key_len | ref  | rows   | Extra       |
+----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+
|  1 | SIMPLE      | geo_ip | range | start_ip,end_ip | start_ip | 4       | NULL | 975222 | Using where |
+----+-------------+--------+-------+-----------------+----------+---------+------+--------+-------------+

This is good enough for me now, although I would love to know the reasoning behind why the optimizer decides not to use the index in the other case.

answered Oct 28 '22 16:10

Jeshurun

Best index for BETWEEN queries are B-TREE indices. See MySQL docs on that Topic.

Click to copy

ALTER TABLE myTable ADD INDEX myIdx USING BTREE (myCol)

answered Oct 28 '22 15:10

Benvorth

If you create an index for start_ip and one for end_ip, I found I could get comparable results to Jeshurun's results without doing the order by, using an inner join with the same table:

Click to copy

select a.* from geo_ip a inner join geo_ip b on a.id=b.id where 2393196360 >= a.start_ip and 2393196360 <= b.end_ip limit 1;

Also you will find MySQL uses a partial index instead of reporting a full-index scan which is more comforting to me.

answered Oct 28 '22 16:10

SeanO

Related questions
                            
                                Remove duplicate rows leaving oldest row Only?
                            
                                Implementing the Hacker News ranking algorithm in SQL
                            
                                Mysql Slow Insert
                            
                                Installing MySQL on Ubuntu with ruby on rails [ERROR]
                            
                                How do I count the number of deleted rows in a MYSQL query?
                            
                                MySQL Replication fails with error "Could not parse relay log event entry."
                            
                                How to select non-empty columns in MySQL?
                            
                                Convert HEX column to DEC in MySQL
                            
                                How to add the JDBC mysql driver to an Eclipse project? [duplicate]
                            
                                Restricting a column to accept only 2 values
                            
                                Select last record in the table
                            
                                'Delete' user but keep records (foreign keys)
                            
                                which jdbc (connector/j) to install for mac os
                            
                                How do I format numbers to have only two decimal places?
                            
                                What is Locking in MySQL and when would you use it? [closed]
                            
                                How do I store a credit card's expiration date if it only consists of the month and year?
                            
                                Is “Where IN” with multiple columns defined in Standard SQL?
                            
                                Automatically importing products into Woocommerce
                            
                                SQL query to extract all WordPress posts with categories
                            
                                Convert SQL results into PHP array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MySQL: Optimal index for between queries

Tags:

indexing

mysql

Jeshurun

People also ask

5 Answers

phil294

Daniel

Jeshurun

Benvorth

SeanO

Recent Activity

Donate For Us