At work we had a query on a table that had the following structure:
ip_from(number), ip_to(number), country, city, state, isp, latitude, longitude.
This table had approx 6.1 million rows.
To find out the details for a given IP address we used a query like the following:
SELECT *
FROM Ip2location
WHERE
:ip_num BETWEEN ip_from AND ip_to;
On Oracle 10 in our dev database this took approximately 17 seconds to return a row, depending on the ip_num
passed in. On our beefier live system it took maybe 5-6 seconds, which was still too slow to do in real time and we needed to select this via a background job.
Not ideal, especially as our real time systems really needed the ip details.
The type of index used was a standard BTREE index spanning both ip_from
and ip_to
. We looked into a lot of things to try to speed this up such as range partitioning. We didn't apply that in the end as it requires Oracle Enterprise. We also looked at increasing the concurrency of the table but that had no noticeable effect.
Anyway when having my morning coffee I realised that I thought there could be a performance enhancement by running the following query: (This is from memory, there may be a couple of mistakes. Also we selected individual fields not everything)
SELECT *
FROM ip2location
WHERE
ip_from = (
SELECT max(ip_from)
FROM ip2location
WHERE ip_from <= :ip_num
)
AND
ip_to >= ip_num;
This works for our data set because there's no overlapping ranges between ip_from
and ip_to
.
However what I wasn't prepared for is how much faster the second query is. The time on our dev database was reduced from 17 seconds to 0.007 seconds.
This makes little sense to me. I'd expect some performance increase, but not that much. Shouldn't the database statistics have figured out there is no overlap and optimised accordingly? Also there has to be a recognised quicker way to select using ranges?
My question is: why is the second query so much faster even using a sub-select?
the performance increase is obvious. Its because there is an index on ip_from , so max(ip_from) can be obtained in constant time because as you know indexing sorts out the values. the range is also easily computed because of binary search over the btree.
while in the previous query has to do a table scan all over the data to compute the range bounds
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With