I want a random selection of rows in PostgreSQL, I tried this: <pre class="prettyprint"><code>select * from table where random() < 0.01; </code></pre> But some other recommend this: <pre class="prettyprint"><code>select * from table order by random() limit 1000; </code></pre> I have a very large table with 500 Million rows, I want it to be fast. Which approach is better? What are the differences? What is the best way to select random rows?

Given your specifications (plus additional info in the comments), <ul> <li>You have a numeric ID column (integer numbers) with only few (or moderately few) gaps.</li> <li>Obviously no or few write operations.</li> <li>Your ID column has to be indexed! A primary key serves nicely.</li> </ul> The query below does not need a sequential scan of the big table, only an index scan. First, get estimates for the main query: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT count(*) AS ct -- optional , min(id) AS min_id , max(id) AS max_id , max(id) - min(id) AS id_span FROM big; </code></pre> The only possibly expensive part is the <code>count(*)</code> (for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here): <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass; </code></pre> As long as <code>ct</code> isn't much smaller than <code>id_span</code>, the query will outperform other approaches. <pre class="prettyprint lang-sql prettyprint-override"><code>WITH params AS ( SELECT 1 AS min_id -- minimum id <= current min id , 5100000 AS id_span -- rounded up. (max_id - min_id + buffer) ) SELECT * FROM ( SELECT p.min_id + trunc(random() * p.id_span)::integer AS id FROM params p ,generate_series(1, 1100) g -- 1000 + buffer GROUP BY 1 -- trim duplicates ) r JOIN big USING (id) LIMIT 1000; -- trim surplus </code></pre> <ul> <li> Generate random numbers in the <code>id</code> space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve. </li> <li> Each <code>id</code> can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use <code>DISTINCT</code>). </li> <li> Join the <code>id</code>s to the big table. This should be very fast with the index in place. </li> <li> Finally trim surplus <code>id</code>s that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked. </li> </ul> <h3>Short version</h3> You can simplify this query. The CTE in the query above is just for educational purposes: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT * FROM ( SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id FROM generate_series(1, 1100) g ) r JOIN big USING (id) LIMIT 1000; </code></pre> <h3>Refine with rCTE</h3> Especially if you are not so sure about gaps and estimates. <pre class="prettyprint lang-sql prettyprint-override"><code>WITH RECURSIVE random_pick AS ( SELECT * FROM ( SELECT 1 + trunc(random() * 5100000)::int AS id FROM generate_series(1, 1030) -- 1000 + few percent - adapt to your needs LIMIT 1030 -- hint for query planner ) r JOIN big b USING (id) -- eliminate miss UNION -- eliminate dupe SELECT b.* FROM ( SELECT 1 + trunc(random() * 5100000)::int AS id FROM random_pick r -- plus 3 percent - adapt to your needs LIMIT 999 -- less than 1000, hint for query planner ) r JOIN big b USING (id) -- eliminate miss ) TABLE random_pick LIMIT 1000; -- actual limit </code></pre> We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance. Duplicates are eliminated by the <code>UNION</code> in the rCTE. The outer <code>LIMIT</code> makes the CTE stop as soon as we have enough rows. This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it. <h3>Wrap into function</h3> For repeated use with varying parameters: <pre class="prettyprint lang-sql prettyprint-override"><code>CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03) RETURNS SETOF big LANGUAGE plpgsql VOLATILE ROWS 1000 AS $func$ DECLARE _surplus int := _limit * _gaps; _estimate int := ( -- get current estimate from system SELECT c.reltuples * _gaps FROM pg_class c WHERE c.oid = 'big'::regclass); BEGIN RETURN QUERY WITH RECURSIVE random_pick AS ( SELECT * FROM ( SELECT 1 + trunc(random() * _estimate)::int FROM generate_series(1, _surplus) g LIMIT _surplus -- hint for query planner ) r (id) JOIN big USING (id) -- eliminate misses UNION -- eliminate dupes SELECT * FROM ( SELECT 1 + trunc(random() * _estimate)::int FROM random_pick -- just to make it recursive LIMIT _limit -- hint for query planner ) r (id) JOIN big USING (id) -- eliminate misses ) TABLE random_pick LIMIT _limit; END $func$; </code></pre> Call: <pre class="prettyprint"><code>SELECT * FROM f_random_sample(); SELECT * FROM f_random_sample(500, 1.05); </code></pre> You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use <code>EXECUTE</code> ... But that's beyond the scope of this question. See: <ul> <li>Refactor a PL/pgSQL function to return the output of various SELECT queries</li> </ul> <h3>Possible alternative</h3> IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing. <h3>Postgres 9.5 introduces <code>TABLESAMPLE SYSTEM (n)</code> </h3> Where <code>n</code> is a percentage. The manual: <blockquote> The <code>BERNOULLI</code> and <code>SYSTEM</code> sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any <code>real</code>-valued expression. </blockquote> Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again: <blockquote> The <code>SYSTEM</code> method is significantly faster than the <code>BERNOULLI</code> method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects. </blockquote> The number of rows returned can vary wildly. For our example, to get roughly 1000 rows: <pre class="prettyprint"><code>SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0); </code></pre> Related: <ul> <li>Fast way to discover the row count of a table in PostgreSQL</li> </ul> Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax: <pre class="prettyprint"><code>SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000); </code></pre> See Evan's answer for details. But that's still not exactly random.

Best way to select random rows PostgreSQL

Tags:

performance

sql

random

postgresql

I want a random selection of rows in PostgreSQL, I tried this:

select * from table where random() < 0.01;

But some other recommend this:

select * from table order by random() limit 1000;

I have a very large table with 500 Million rows, I want it to be fast.

Which approach is better? What are the differences? What is the best way to select random rows?

977

asked Dec 29 '11 23:12

nanounanue

1 Answers

Given your specifications (plus additional info in the comments),

You have a numeric ID column (integer numbers) with only few (or moderately few) gaps.
Obviously no or few write operations.
Your ID column has to be indexed! A primary key serves nicely.

The query below does not need a sequential scan of the big table, only an index scan.

First, get estimates for the main query:

SELECT count(*) AS ct              -- optional      , min(id)  AS min_id      , max(id)  AS max_id      , max(id) - min(id) AS id_span FROM   big;

The only possibly expensive part is the count(*) (for huge tables). Given above specifications, you don't need it. An estimate will do just fine, available at almost no cost (detailed explanation here):

SELECT reltuples AS ct FROM pg_class WHERE oid = 'schema_name.big'::regclass;

As long as ct isn't much smaller than id_span, the query will outperform other approaches.

WITH params AS (    SELECT 1       AS min_id           -- minimum id <= current min id         , 5100000 AS id_span          -- rounded up. (max_id - min_id + buffer)     ) SELECT * FROM  (    SELECT p.min_id + trunc(random() * p.id_span)::integer AS id    FROM   params p          ,generate_series(1, 1100) g  -- 1000 + buffer    GROUP  BY 1                        -- trim duplicates ) r JOIN   big USING (id) LIMIT  1000;                          -- trim surplus

Generate random numbers in the id space. You have "few gaps", so add 10 % (enough to easily cover the blanks) to the number of rows to retrieve.
Each id can be picked multiple times by chance (though very unlikely with a big id space), so group the generated numbers (or use DISTINCT).
Join the ids to the big table. This should be very fast with the index in place.
Finally trim surplus ids that have not been eaten by dupes and gaps. Every row has a completely equal chance to be picked.

Short version

You can simplify this query. The CTE in the query above is just for educational purposes:

SELECT * FROM  (    SELECT DISTINCT 1 + trunc(random() * 5100000)::integer AS id    FROM   generate_series(1, 1100) g    ) r JOIN   big USING (id) LIMIT  1000;

Refine with rCTE

Especially if you are not so sure about gaps and estimates.

WITH RECURSIVE random_pick AS (    SELECT *    FROM  (       SELECT 1 + trunc(random() * 5100000)::int AS id       FROM   generate_series(1, 1030)  -- 1000 + few percent - adapt to your needs       LIMIT  1030                      -- hint for query planner       ) r    JOIN   big b USING (id)             -- eliminate miss     UNION                               -- eliminate dupe    SELECT b.*    FROM  (       SELECT 1 + trunc(random() * 5100000)::int AS id       FROM   random_pick r             -- plus 3 percent - adapt to your needs       LIMIT  999                       -- less than 1000, hint for query planner       ) r    JOIN   big b USING (id)             -- eliminate miss    ) TABLE  random_pick LIMIT  1000;  -- actual limit

We can work with a smaller surplus in the base query. If there are too many gaps so we don't find enough rows in the first iteration, the rCTE continues to iterate with the recursive term. We still need relatively few gaps in the ID space or the recursion may run dry before the limit is reached - or we have to start with a large enough buffer which defies the purpose of optimizing performance.

Duplicates are eliminated by the UNION in the rCTE.

The outer LIMIT makes the CTE stop as soon as we have enough rows.

This query is carefully drafted to use the available index, generate actually random rows and not stop until we fulfill the limit (unless the recursion runs dry). There are a number of pitfalls here if you are going to rewrite it.

Wrap into function

For repeated use with varying parameters:

CREATE OR REPLACE FUNCTION f_random_sample(_limit int = 1000, _gaps real = 1.03)   RETURNS SETOF big   LANGUAGE plpgsql VOLATILE ROWS 1000 AS $func$ DECLARE    _surplus  int := _limit * _gaps;    _estimate int := (           -- get current estimate from system       SELECT c.reltuples * _gaps       FROM   pg_class c       WHERE  c.oid = 'big'::regclass); BEGIN    RETURN QUERY    WITH RECURSIVE random_pick AS (       SELECT *       FROM  (          SELECT 1 + trunc(random() * _estimate)::int          FROM   generate_series(1, _surplus) g          LIMIT  _surplus           -- hint for query planner          ) r (id)       JOIN   big USING (id)        -- eliminate misses        UNION                        -- eliminate dupes       SELECT *       FROM  (          SELECT 1 + trunc(random() * _estimate)::int          FROM   random_pick        -- just to make it recursive          LIMIT  _limit             -- hint for query planner          ) r (id)       JOIN   big USING (id)        -- eliminate misses    )    TABLE  random_pick    LIMIT  _limit; END $func$;

Call:

SELECT * FROM f_random_sample(); SELECT * FROM f_random_sample(500, 1.05);

You could even make this generic to work for any table: Take the name of the PK column and the table as polymorphic type and use EXECUTE ... But that's beyond the scope of this question. See:

Refactor a PL/pgSQL function to return the output of various SELECT queries

Possible alternative

IF your requirements allow identical sets for repeated calls (and we are talking about repeated calls) I would consider a materialized view. Execute above query once and write the result to a table. Users get a quasi random selection at lightening speed. Refresh your random pick at intervals or events of your choosing.

Postgres 9.5 introduces `TABLESAMPLE SYSTEM (n)`

Where n is a percentage. The manual:

The BERNOULLI and SYSTEM sampling methods each accept a single argument which is the fraction of the table to sample, expressed as a percentage between 0 and 100. This argument can be any real-valued expression.

Bold emphasis mine. It's very fast, but the result is not exactly random. The manual again:

The SYSTEM method is significantly faster than the BERNOULLI method when small sampling percentages are specified, but it may return a less-random sample of the table as a result of clustering effects.

The number of rows returned can vary wildly. For our example, to get roughly 1000 rows:

SELECT * FROM big TABLESAMPLE SYSTEM ((1000 * 100) / 5100000.0);

Fast way to discover the row count of a table in PostgreSQL

Or install the additional module tsm_system_rows to get the number of requested rows exactly (if there are enough) and allow for the more convenient syntax:

SELECT * FROM big TABLESAMPLE SYSTEM_ROWS(1000);

See Evan's answer for details.

But that's still not exactly random.

127

answered Sep 18 '22 05:09

Erwin Brandstetter

Related questions
                            
                                Oracle: If Table Exists
                            
                                Difference between numeric, float and decimal in SQL Server
                            
                                How can I see the raw SQL queries Django is running?
                            
                                SQL join: selecting the last records in a one-to-many relationship
                            
                                How to replace a string in a SQL Server Table Column
                            
                                SQL to find the number of distinct values in a column
                            
                                Best database field type for a URL
                            
                                Get record counts for all tables in MySQL database
                            
                                Selecting data from two different servers in SQL Server
                            
                                Altering a column to be nullable
                            
                                How to insert a value that contains an apostrophe (single quote)?
                            
                                How to use GROUP BY to concatenate strings in MySQL?
                            
                                Is a view faster than a simple query?
                            
                                Is there a combination of "LIKE" and "IN" in SQL?
                            
                                Can a foreign key be NULL and/or duplicate?
                            
                                What is the difference between Views and Materialized Views in Oracle?
                            
                                Postgres and Indexes on Foreign Keys and Primary Keys
                            
                                updating table rows in postgres using subquery
                            
                                How to concatenate strings of a string field in a PostgreSQL 'group by' query?
                            
                                Get day of week in SQL Server 2005/2008

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Best way to select random rows PostgreSQL

Tags:

performance

sql

random

postgresql

nanounanue

People also ask

1 Answers

Short version

Refine with rCTE

Wrap into function

Possible alternative

Postgres 9.5 introduces `TABLESAMPLE SYSTEM (n)`

Erwin Brandstetter

Recent Activity

Donate For Us

Best way to select random rows PostgreSQL

Tags:

performance

sql

random

postgresql

nanounanue

People also ask

1 Answers

Short version

Refine with rCTE

Wrap into function

Possible alternative

Postgres 9.5 introduces TABLESAMPLE SYSTEM (n)

Erwin Brandstetter

Related questions

Recent Activity

Donate For Us

Postgres 9.5 introduces `TABLESAMPLE SYSTEM (n)`