Is it reasonable to stuff 1000 ids into a SELECT ... WHERE ... IN (...) query on Postgres? [duplicate]

Tags:

Possible Duplicate:
PostgreSQL - max number of parameters in “IN” clause?

I'm developing a web API to perform RESTful queries on a resource that maps nicely to a Postgres table. Most of the filtering parameters also map nicely to parameters on the SQL query. A few of the filtering parameters, however, require a call to my search index (in this case, a Sphinx server).

The simplest thing to do is to run my search, collect the primary keys from the search results, and stuff those into an IN (...) clause on the SQL query. However, since the search could return a lot of primary keys, I wonder if this such a bright idea.

I expect that most of the time (say, 90%), my search will be returning results on the order of a few hundred. Perhaps 10% of the time, there will be on the order of several thousand results.

Is this a reasonable approach? Is there a better way?

345

asked Mar 28 '12 16:03

David Eyk

2 Answers

I strongly favor the experimental approach to answer performance questions. @Catcall made a nice start, but sized his experiment much smaller than many real databases. His 300000 single integer rows easily fit in memory, so no IO is occuring; in addition he didn't share the actual numbers.

I composed a similar experiment, but sized the sample data to be about 7x as large as the available memory on my host (7GB dataset on a 1GB 1-fracctional CPU VM, NFS mounted filesystem). There are 30,000,000 rows composed of a single indexed bigint and a random length string between 0 and 400 bytes.

create table t(id bigint primary key, stuff text);
insert into t(id,stuff) select i, repeat('X',(random()*400)::integer)
from generate_series(0,30000000) i;
analyze t;

What follows are explain analyze runtimes for a select IN of sets of 10, 100, 1,000, 10,000 and 100,000 random integers in the keys domain. each query is in the following form, with $1 replaced by the set counts.

explain analyze
select id from t
where id in (
  select (random()*30000000)::integer from generate_series(0,$1)
);

Summary Times

ct, tot ms, ms/row
10, 84, 8.4
100, 1185, 11.8
1,000, 12407, 12.4
10,000, 109747, 11.0
100,000, 1016842, 10.1

Note the plan stays the same for each IN set cardinality -- build a hash aggregate of the random integers, then loop and and do a single indexed lookup for each value. The fetch time is near linear with the cardinality of the IN set, in the 8-12 ms/row range. A faster storage system could undoubtably improve these times dramatically, but the experiment shows that Pg handles very large sets in the IN clause with aplomb -- at least from an execution speed perspective. Note if you supply the list via bind-parameter or literal interpolation of the sql statement, you will incur additional overhead on the network transmission of the query to the server, and increased parse times, though I suspect they will be negligible compared tot he IO time of execting the query.

# fetch 10
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=0.110..84.494 rows=11 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.046..0.054 rows=11 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.036..0.039 rows=11 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=7.672..7.673 rows=1 loops=11)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 84.580 ms


# fetch 100
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=12.405..1184.758 rows=101 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.095..0.210 rows=101 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.046..0.067 rows=101 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=11.723..11.725 rows=1 loops=101)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 1184.843 ms

# fetch 1,000
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=14.403..12406.667 rows=1001 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=0.609..1.689 rows=1001 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.128..0.332 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=12.381..12.390 rows=1 loops=1001)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 12407.059 ms

# fetch 10,000
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=21.884..109743.854 rows=9998 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=5.761..18.090 rows=9998 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=1.004..3.087 rows=10001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=10.968..10.972 rows=1 loops=9998)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 109747.169 ms

# fetch 100,000
 Nested Loop  (cost=30.00..2341.27 rows=15002521 width=8) (actual time=110.244..1016781.944 rows=99816 loops=1)
   ->  HashAggregate  (cost=30.00..32.00 rows=200 width=4) (actual time=110.169..253.947 rows=99816 loops=1)
         ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=51.141..77.482 rows=100001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=10.176..10.181 rows=1 loops=99816)
         Index Cond: (t.id = (((random() * 30000000::double precision))::integer))
 Total runtime: 1016842.772 ms

At @Catcall 's request, I ran similar queries using CTE and temp table. Both approaches had comparably simple nest loop index scan plans and ran in comparable (though slightly slower) times as the inline IN queries.

-- CTE
EXPLAIN analyze
with ids as (select (random()*30000000)::integer as val from generate_series(0,1000))
select id from t where id in (select ids.val from ids);

 Nested Loop  (cost=40.00..2351.27 rows=15002521 width=8) (actual time=21.203..12878.329 rows=1001 loops=1)
   CTE ids
     ->  Function Scan on generate_series  (cost=0.00..17.50 rows=1000 width=0) (actual time=0.085..0.306 rows=1001 loops=1)
   ->  HashAggregate  (cost=22.50..24.50 rows=200 width=4) (actual time=0.771..1.907 rows=1001 loops=1)
         ->  CTE Scan on ids  (cost=0.00..20.00 rows=1000 width=4) (actual time=0.087..0.552 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=12.859..12.861 rows=1 loops=1001)
         Index Cond: (t.id = ids.val)
 Total runtime: 12878.812 ms
(8 rows)

-- Temp table
create table temp_ids as select (random()*30000000)::bigint as val from generate_series(0,1000);

explain analyze select id from t where t.id in (select val from temp_ids);

 Nested Loop  (cost=17.51..11585.41 rows=1001 width=8) (actual time=7.062..15724.571 rows=1001 loops=1)
   ->  HashAggregate  (cost=17.51..27.52 rows=1001 width=8) (actual time=0.268..1.356 rows=1001 loops=1)
         ->  Seq Scan on temp_ids  (cost=0.00..15.01 rows=1001 width=8) (actual time=0.007..0.080 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.53 rows=1 width=8) (actual time=15.703..15.705 rows=1 loops=1001)
         Index Cond: (t.id = temp_ids.val)
 Total runtime: 15725.063 ms

-- another way using join against temptable insteed of IN
explain analyze select id from t join temp_ids on (t.id = temp_ids.val);

Nested Loop  (cost=0.00..24687.88 rows=2140 width=8) (actual time=22.594..16557.789 rows=1001 loops=1)
   ->  Seq Scan on temp_ids  (cost=0.00..31.40 rows=2140 width=8) (actual time=0.014..0.872 rows=1001 loops=1)
   ->  Index Scan using t_pkey on t  (cost=0.00..11.51 rows=1 width=8) (actual time=16.536..16.537 rows=1 loops=1001)
         Index Cond: (t.id = temp_ids.val)
 Total runtime: 16558.331 ms

The temp table queries ran very much faster if run again, but that's because the id value set is constant, so the target data is fresh in cache and Pg does no real IO to execute the second time.

answered Sep 28 '22 06:09

dbenhur

My somewhat naive tests show that using IN (...) is at least an order of magnitude faster than both a join on a temp table and a join on a common table expression. (Frankly, that surprised me.) I tested 3000 integer values from a table of 300000 rows.

create table integers (
  n integer primary key
);
insert into integers
select generate_series(0, 300000);

-- External ruby program generates 3000 random integers in the range of 0 to 299999.
-- Used Emacs to massage the output into a SQL statement that looks like

explain analyze
select integers.n 
from integers where n in (
100109,
100354 ,
100524 ,
...
);

answered Sep 28 '22 04:09

Mike Sherrill 'Cat Recall'

Related questions
                            
                                Convert scientific notation back to numbers in SQL Server
                            
                                Store multidimensional array in database: relational or multidimensional?
                            
                                UPDATE based on if value exist in another table
                            
                                ORA-01031: insufficient privileges when creating package
                            
                                What is the SQL for 'next' and 'previous' in a table?
                            
                                TSQL: UPDATE with INSERT INTO SELECT FROM
                            
                                read/write unicode data in MySql
                            
                                Modify unique constraint in Oracle
                            
                                SQLAlchemy: How to group by two fields and filter by date
                            
                                what is index and can non-clustered index be non-unique?
                            
                                How to sort by numbers first with Oracle SQL query?
                            
                                How to count and group items by day of the week?
                            
                                MySQL INSERT/UPDATE on POINT column
                            
                                How to get the last month data and month to date data
                            
                                Foreign keys must be Index in mySQL?
                            
                                MS SQL select a datetime by month?
                            
                                postgres: What is the query 'select * from user' actually doing?
                            
                                sql join two table
                            
                                SSIS issue Failed to decrypt protected XML node
                            
                                How to make a list of T-SQL results with comma's between them?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it reasonable to stuff 1000 ids into a SELECT ... WHERE ... IN (...) query on Postgres? [duplicate]

Tags:

sql

search

postgresql

David Eyk

People also ask

2 Answers

dbenhur

Mike Sherrill 'Cat Recall'

Recent Activity

Donate For Us