Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimal Performance for Joining on Range of Values

I have a really big table that contains integer representations of IP addresses and a second table that has starting and ending ranges of integers representations of IP addresses. The second table is used to return the country as per several stackoverflow articles. Although this returns the required results, the performance is fairly poor. Is there any higher performing alternative to joining on a range? Below is a sample set of code that shows how the join works currently:

CREATE TABLE #BaseTable
    ( SomeIntegerValue INT PRIMARY KEY);

INSERT INTO #BaseTable (SomeIntegerValue)
SELECT SomeIntegerValue
FROM (VALUES
    (123), (456), (789)) Data (SomeIntegerValue);

CREATE TABLE #RangeLookupTable
    ( RangeStartValue INT PRIMARY KEY
    , RangeEndValue INT NOT NULL);

INSERT INTO #RangeLookupTable (RangeStartValue, RangeEndValue)
SELECT RangeStartValue, RangeEndValue
FROM (VALUES
      (0, 100), (101, 200), (201, 300)
    , (301, 400), (401, 500), (501, 600)
    , (701, 800), (901, 1000)) Data (RangeStartValue, RangeEndValue);

SELECT *
FROM #BaseTable bt
JOIN #RangeLookupTable rlt
    ON bt.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue
like image 661
Registered User Avatar asked Mar 22 '13 21:03

Registered User


People also ask

What is Range join optimization?

A range join occurs when two relations are joined using a point in interval or interval overlap condition. The range join optimization support in Databricks Runtime can bring orders of magnitude improvement in query performance, but requires careful manual tuning.

Does the order of joins matter for performance?

Basically, join order DOES matter because if we can join two tables that will reduce the number of rows needed to be processed by subsequent steps, then our performance will improve.

Which join is most efficient in SQL?

Relational algebra is the most common way of writing a query and also the most natural way to do so. The code is clean, easy to troubleshoot, and unsurprisingly, it is also the most efficient way to join two tables.


1 Answers

If the specific situation permits holding de-normalized data in a table, and then querying from that table rather than the normalized base table, a very fast retrieval time could be achieved. The query Execution Plan shows 2x gain in the SELECT, even with this sample data of 3 rows.

Such an approach would be possible in a scenario having relatively fewer writes and more read operations. The JOIN will need to be executed only when updating the data; testing with actual data will show how much (or whether any at all!) improvement is actually achieved in the overall (UPDATE + SELECT) system picture.

Sample code, along with the resulting Execution Plan screenshots for the SELECT statements, is given below.

CREATE TABLE #BaseTable
    ( SomeIntegerValue INT PRIMARY KEY);

INSERT INTO #BaseTable (SomeIntegerValue)
SELECT SomeIntegerValue
FROM (VALUES
    (123), (456), (789)) Data (SomeIntegerValue);

CREATE TABLE #RangeLookupTable
    ( RangeStartValue INT PRIMARY KEY
    , RangeEndValue INT NOT NULL);

INSERT INTO #RangeLookupTable (RangeStartValue, RangeEndValue)
SELECT RangeStartValue, RangeEndValue
FROM (VALUES
      (0, 100), (101, 200), (201, 300)
    , (301, 400), (401, 500), (501, 600)
    , (701, 800), (901, 1000)) Data (RangeStartValue, RangeEndValue);

-- Alternative approach: Denormalized base table
CREATE TABLE #BaseTable2
    ( SomeIntegerValue INT PRIMARY KEY,
      RangeStartValue INT null,
      RangeEndValue INT NULL);

INSERT INTO #BaseTable2 (SomeIntegerValue)
SELECT SomeIntegerValue
FROM (VALUES
    (123), (456), (789)) Data (SomeIntegerValue);

UPDATE #BaseTable2
SET RangeStartValue = rlt.RangeStartValue,
    RangeEndValue = rlt.RangeEndValue
FROM #BaseTable2 bt2
JOIN #RangeLookupTable rlt
    ON bt2.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue

-- The original: SELECT with a JOIN
SELECT *
FROM #BaseTable bt
JOIN #RangeLookupTable rlt
    ON bt.SomeIntegerValue BETWEEN rlt.RangeStartValue AND rlt.RangeEndValue

-- The alternative: SELECT from the denormalized base table
SELECT * from #BaseTable2;

GO

Query execution plans for the JOINed vs. denormalized SELECTs:

Query Execution with a JOIN vs. a denormalized table

like image 123
Krishna Gupta Avatar answered Sep 30 '22 20:09

Krishna Gupta