Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Joining 2 large postgres tables using int8range not scaling well

I'd like to join IP routing table information to IP whois information. I'm using Amazon's RDS which means I can't use the Postgres ip4r extension, and so I am instead using int8range types to represent the IP address ranges, with gist indexes.

My tables look like this:

=> \d routing_details
     Table "public.routing_details"
  Column  |   Type    | Modifiers
----------+-----------+-----------
 asn      | text      |
 netblock | text      |
 range    | int8range |
Indexes:
    "idx_routing_details_netblock" btree (netblock)
    "idx_routing_details_range" gist (range)


=> \d netblock_details
    Table "public.netblock_details"
   Column   |   Type    | Modifiers
------------+-----------+-----------
 range      | int8range |
 name       | text      |
 country    | text      |
 source     | text      |
Indexes:
    "idx_netblock_details_range" gist (range)

The full routing_details table contains just under 600K rows, and netblock_details contains around 8.25M rows. There are overlapping ranges in both tables, but for each range in the routing_details table I want to get the single best (smallest) match from the netblock_details table.

I came up with 2 different queries that I think will return the accurate data, one using window functions and one using DISTINCT ON:

EXPLAIN SELECT DISTINCT ON (r.netblock) *
FROM routing_details r JOIN netblock_details n ON r.range <@ n.range
ORDER BY r.netblock, upper(n.range) - lower(n.range);
                                              QUERY PLAN
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=118452809778.47..118477166326.22 rows=581300 width=91)
   Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, r.netblock, ((upper(n.range) - lower(n.range)))
   ->  Sort  (cost=118452809778.47..118464988052.34 rows=4871309551 width=91)
         Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, r.netblock, ((upper(n.range) - lower(n.range)))
         Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
         ->  Nested Loop  (cost=0.00..115920727265.53 rows=4871309551 width=91)
               Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, r.netblock, (upper(n.range) - lower(n.range))
               Join Filter: (r.range <@ n.range)
               ->  Seq Scan on public.routing_details r  (cost=0.00..11458.96 rows=592496 width=43)
                     Output: r.asn, r.netblock, r.range
               ->  Materialize  (cost=0.00..277082.12 rows=8221675 width=48)
                     Output: n.range, n.name, n.country
                     ->  Seq Scan on public.netblock_details n  (cost=0.00..163712.75 rows=8221675 width=48)
                           Output: n.range, n.name, n.country
(14 rows)               ->  Seq Scan on netblock_details n  (cost=0.00..163712.75 rows=8221675 width=48)


EXPLAIN VERBOSE SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY r.range ORDER BY UPPER(n.range) - LOWER(n.range)) AS rank
FROM routing_details r JOIN netblock_details n ON r.range <@ n.range
) a WHERE rank = 1 ORDER BY netblock;

                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=118620775630.16..118620836521.53 rows=24356548 width=99)
   Output: a.asn, a.netblock, a.range, a.range_1, a.name, a.country, a.rank
   Sort Key: a.netblock
   ->  Subquery Scan on a  (cost=118416274956.83..118611127338.87 rows=24356548 width=99)
         Output: a.asn, a.netblock, a.range, a.range_1, a.name, a.country, a.rank
         Filter: (a.rank = 1)
         ->  WindowAgg  (cost=118416274956.83..118550235969.49 rows=4871309551 width=91)
               Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, row_number() OVER (?), ((upper(n.range) - lower(n.range))), r.range
               ->  Sort  (cost=118416274956.83..118428453230.71 rows=4871309551 width=91)
                     Output: ((upper(n.range) - lower(n.range))), r.range, r.asn, r.netblock, n.range, n.name, n.country
                     Sort Key: r.range, ((upper(n.range) - lower(n.range)))
                     ->  Nested Loop  (cost=0.00..115884192443.90 rows=4871309551 width=91)
                           Output: (upper(n.range) - lower(n.range)), r.range, r.asn, r.netblock, n.range, n.name, n.country
                           Join Filter: (r.range <@ n.range)
                           ->  Seq Scan on public.routing_details r  (cost=0.00..11458.96 rows=592496 width=43)
                                 Output: r.asn, r.netblock, r.range
                           ->  Materialize  (cost=0.00..277082.12 rows=8221675 width=48)
                                 Output: n.range, n.name, n.country
                                 ->  Seq Scan on public.netblock_details n  (cost=0.00..163712.75 rows=8221675 width=48)
                                       Output: n.range, n.name, n.country
(20 rows)

The DISTINCT ON seems slightly more efficient, so I've proceeded with that one. When I run the query against the full dataset I get an out of disk space error after around a 24h wait. I've created a routing_details_small table with a subset of N rows of the full routing_details table to try and understand what's going on.

With N=1000

=> EXPLAIN ANALYZE SELECT DISTINCT ON (r.netblock) *
-> FROM routing_details_small r JOIN netblock_details n ON r.range <@ n.range
-> ORDER BY r.netblock, upper(n.range) - lower(n.range);
                                                                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=4411888.68..4453012.20 rows=999 width=90) (actual time=124.094..133.720 rows=999 loops=1)
   ->  Sort  (cost=4411888.68..4432450.44 rows=8224705 width=90) (actual time=124.091..128.560 rows=4172 loops=1)
         Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
         Sort Method: external sort  Disk: 608kB
         ->  Nested Loop  (cost=0.41..1780498.29 rows=8224705 width=90) (actual time=0.080..101.518 rows=4172 loops=1)
               ->  Seq Scan on routing_details_small r  (cost=0.00..20.00 rows=1000 width=42) (actual time=0.007..1.037 rows=1000 loops=1)
               ->  Index Scan using idx_netblock_details_range on netblock_details n  (cost=0.41..1307.55 rows=41124 width=48) (actual time=0.063..0.089 rows=4 loops=1000)
                     Index Cond: (r.range <@ range)
 Total runtime: 134.999 ms
(9 rows)

With N=100000

=> EXPLAIN ANALYZE SELECT DISTINCT ON (r.netblock) *
-> FROM routing_details_small r JOIN netblock_details n ON r.range <@ n.range
-> ORDER BY r.netblock, upper(n.range) - lower(n.range);
                                                                                 QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=654922588.98..659034941.48 rows=200 width=144) (actual time=28252.677..29487.380 rows=98992 loops=1)
   ->  Sort  (cost=654922588.98..656978765.23 rows=822470500 width=144) (actual time=28252.673..28926.703 rows=454856 loops=1)
         Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
         Sort Method: external merge  Disk: 64488kB
         ->  Nested Loop  (cost=0.41..119890431.75 rows=822470500 width=144) (actual time=0.079..24951.038 rows=454856 loops=1)
               ->  Seq Scan on routing_details_small r  (cost=0.00..1935.00 rows=100000 width=96) (actual time=0.007..110.457 rows=100000 loops=1)
               ->  Index Scan using idx_netblock_details_range on netblock_details n  (cost=0.41..725.96 rows=41124 width=48) (actual time=0.067..0.235 rows=5 loops=100000)
                     Index Cond: (r.range <@ range)
 Total runtime: 29596.667 ms
(9 rows)

With N=250000

=> EXPLAIN ANALYZE SELECT DISTINCT ON (r.netblock) *
-> FROM routing_details_small r JOIN netblock_details n ON r.range <@ n.range
-> ORDER BY r.netblock, upper(n.range) - lower(n.range);
                                                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1651822953.55..1662103834.80 rows=200 width=144) (actual time=185835.443..190143.266 rows=247655 loops=1)
   ->  Sort  (cost=1651822953.55..1656963394.18 rows=2056176250 width=144) (actual time=185835.439..188779.279 rows=1103850 loops=1)
         Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
         Sort Method: external merge  Disk: 155288kB
         ->  Nested Loop  (cost=0.28..300651962.46 rows=2056176250 width=144) (actual time=19.325..177403.913 rows=1103850 loops=1)
               ->  Seq Scan on netblock_details n  (cost=0.00..163743.05 rows=8224705 width=48) (actual time=0.007..8160.346 rows=8224705 loops=1)
               ->  Index Scan using idx_routing_details_small_range on routing_details_small r  (cost=0.28..22.16 rows=1250 width=96) (actual time=0.018..0.018 rows=0 loops=8224705)
                     Index Cond: (range <@ n.range)
 Total runtime: 190413.912 ms
(9 rows)

Against the full table with 600k rows the query fails after around 24h with an error about running out of disk space, which is presumably caused by the external merge step. So this query is working well and very quickly with a small routing_details table, but is scaling very poorly.

Suggestions for how to improve my query, or perhaps even schema changes I could make so that this query will work efficiently on the full dataset?

like image 756
Ben Dowling Avatar asked Jul 23 '15 10:07

Ben Dowling


People also ask

Can Postgres handle 1 billion rows?

As commercial database vendors are bragging about their capabilities we decided to push PostgreSQL to the next level and exceed 1 billion rows per second to show what we can do with Open Source. To those who need even more: 1 billion rows is by far not the limit - a lot more is possible. Watch and see how we did it.

How much can Postgres scale?

Over the years, we've scaled up to 75 terabyte (TB) of stored data across nearly 40 servers. Our real-time segmentation features have benefited greatly from PostgreSQL's performance, but we've also struggled at times due to bloat caused by our heavy write load and limitations of the PostgreSQL upgrade path.

How big is too big for a Postgres database?

There is no PostgreSQL-imposed limit on the number of indexes you can create on a table. Of course, performance may degrade if you choose to create more and more indexes on a table with more and more columns. PostgreSQL has a limit of 1GB for the size of any one field in a table.

Does number of columns affect performance in Postgres?

Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed. Why is that? Every DBMS stores rows in blocks - typically 8k blocks, but not necessarily.


1 Answers

I was thinking originally of a lateral join as in other suggested approaches (for example, the last query by Erwin Brandstetter, where he uses simple int8 datatype and simple indexes), but didn't want to write it in the answer, because I thought that it is not really efficient. When you try to find all netblock ranges that cover the given range, an index doesn't help much.

I'll repeat the Erwin Brandstetter's query here:

SELECT *  -- only select columns you need to make it faster
FROM   routing_details r
     , LATERAL (
   SELECT *
   FROM   netblock_details n
   WHERE  n.ip_min <= r.ip_min
   AND    n.ip_max >= r.ip_max
   ORDER  BY n.ip_max - n.ip_min
   LIMIT  1
   ) n;

When you have an index on netblock_details, like this:

CREATE INDEX netblock_details_ip_min_max_idx ON netblock_details 
(ip_min, ip_max DESC NULLS LAST);

you can quickly (in O(logN)) find the starting point of the scan in the netblock_details table - either the maximum n.ip_min that is less than r.ip_min, or the minimum n.ip_max that is more than r.ip_max. But then you have to scan/read the rest of the index/table and for each row do the second part of the check and filter out most rows.

In other words, this index helps to quickly find the starting row that satisfies first search criteria: n.ip_min <= r.ip_min, but then you'll continue reading all rows that satisfy this criteria and for each such row perform the second check n.ip_max >= r.ip_max. On average (if data has even distribution) you'll have to read half of the rows of the netblock_details table. Optimizer may choose to use index to search n.ip_max >= r.ip_max first and then apply second filter n.ip_min <= r.ip_min, but you can't use this index to apply both filters together.

End result: for each row from routing_details we'll read through half of rows from netblock_details. 600K * 4M = 2,400,000,000,000 rows. It is 2 times better than Cartesian product. You can see this number (Cartesian product) in the output of EXPLAIN ANALYZE in the question.

592,496 * 8,221,675 = 4,871,309,550,800

No wonder your queries run out of disk space when trying to materialize and sort this.


The overall high level process to get to the final result:

  • join two tables (without finding the best match yet). In the worst case it is Cartesian product, in the best case it is all covering ranges from netblock_details for each range from routing_details. You said that there are multiple entries in netblock_details for each entry in routing_details, anything from 3 to around 10. So, result of this join could have ~6M rows (not too much)

  • order/partition the result of the join by the routing_details ranges and for each such range find the best (smallest) covering range from netblock_details.


My idea is to reverse the query. Instead of finding all covering ranges from larger netblock_details for each row from smaller routing_details table I suggest to find all smaller ranges from smaller routing_details for each row from larger netblock_details.

Two step process

For each row from larger netblock_details find all ranges from routing_details that are inside the netblock range.

I would use the following schema in the queries. I've added primary key ID to both tables.

CREATE TABLE routing_details (
ID        int
,ip_min   int8
,ip_max   int8
,asn      text
,netblock text
);

CREATE TABLE netblock_details (
ID        int
,ip_min   int8
,ip_max   int8
,name     text
,country  text
,source   text
);

SELECT
    netblock_details.ID AS n_ID
    ,netblock_details.ip_max - netblock_details.ip_min AS n_length
    ,r.ID AS r_ID
FROM
    netblock_details
    INNER JOIN LATERAL
    (
        SELECT routing_details.ID
        FROM routing_details
        WHERE
            routing_details.ip_min >= netblock_details.ip_min
            AND routing_details.ip_min <= netblock_details.ip_max
            -- note how routing_details.ip_min is limited from both sides
            -- this would make it possible to scan only (hopefully) small
            -- portion of the table instead of full or half table
            AND routing_details.ip_max <= netblock_details.ip_max
            -- this clause ensures that the whole routing range
            -- is inside the netblock range
    ) AS r ON true

We need index on routing_details on (ip_min, ip_max). The main thing here is index on ip_min. Having second column in the index helps by eliminating the need to do the lookup for the value of ip_max; it doesn't help in the tree search.

Note that the lateral subquery doesn't have LIMIT. It is not the final result yet. This query should return ~6M rows. Save this result in a temporary table. Add an index to the temporary table on (r_ID, n_length, n_ID). n_ID is again just to remove extra lookups. We need an index do avoid sorting the data for each r_ID.

Final step

For each row from routing_details find the n_ID with the smallest n_length. Now we can use the lateral join in "proper" order. Here temp table is result of the previous step with the index.

SELECT
    routing_details.*
    ,t.n_ID
    ,netblock_details.*
FROM
    routing_details
    INNER JOIN LATERAL
    (
        SELECT temp.n_ID
        FROM temp
        WHERE temp.r_ID = routing_details.ID
        ORDER BY temp.n_length
        LIMIT 1
    ) AS t ON true
    INNER JOIN netblock_details ON netblock_details.ID = t.n_ID

Here subquery should be a seek of the index, not scan. Optimizer should be smart enough to do just the seek and return the first found result because of LIMIT 1, so you'll have 600K seeks of index in 6M row temp table.


Original answer (I'll keep it just for the diagram of ranges):

Can you "cheat"?

If I understood your query correctly, for each routing_details.range you want to find a smallest netblock_details.range that covers/is larger than routing_details.range.

Given the following example, where r is routing range and n1, ..., n8 are netblock ranges, the correct answer is n5.

   |---|
   n1

     |------------------|
     n2

                           |---------------|
                           n3

                                          |-----|
                                          n4

                  |------------------|
                  n5

                     |--------------------------------------|
                     n6

        |---------------------------|
        n7

                      |-----|
                      n8

                      |------------|
                      r
                     start       end

n.start <= r.start AND n.end >= r.end
order by n.length
limit 1 

Your query that took 14 hours shows that index scan took 6ms, but sorting by range length took 80ms.

With this kind of search there is no simple 1D ordering of the data. You are using n.start together with n.end and together with n.length. But, maybe your data is not that generic, or it is OK to return a somewhat different result.

For example, if it was OK to return n6 as a result, it could work much faster. n6 is the covering range that has largest start:

n.start <= r.start AND n.end >= r.end
order by n.start desc
limit 1 

Or, you could go for n7, which has smallest end:

n.start <= r.start AND n.end >= r.end
order by n.end
limit 1 

This kind of search would use simple index on n.start (or n.end) without extra sorting.


A second, completely different approach. How big/long are the ranges? If they are relatively short (few numbers), then you could try to store them as an explicit list of integers, instead of a range. For example, range [5-8] would be stored as 4 rows: (5, 6, 7, 8). With this storage model it may be easier to find intersections of ranges.

like image 104
Vladimir Baranov Avatar answered Nov 15 '22 23:11

Vladimir Baranov