Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to guarantee that at least N rows are returned by recursive CTE in Postgres

Most resources that describe a SELECT TOP ... query in Postgres say that you should use LIMIT instead, possibly with an ORDER BY clause if you need to select the top elements by some ordering.

What do you do if you need to select the top N elements from a recursive query, where there is no ordering and there is a possibility the query can return fewer than N rows without the recursion (so that the TOP part is necessary to ensure the result set is at least N rows, whereas LIMIT can allow fewer rows)?

My specific use case is a modification of the dynamic SQL pattern for selecting a random subsample of a table.

Here is a link to the sql source of my modification. The easiest thing is to look at the final function defined there, _random_select. It follows very closely to the above link, but has been modified to be polymorphic in the input table and output result set, as well as correctly accounting for the need to return only the columns that already existed in the input table (yet another dynamic SQL hack to exclude the interim row_number result from the final result set).

It's an eyesore, but it's the closest thing I have to a reproducible example. If you use _random_select and attempt to get something around 4500 rows from a table larger than 4500 rows, you begin to see smaller result sets with high probability, and it only gets worse as you increase the size of your desired sample (because the occurrence of duplicates gets worse as your desired sample gets larger).

Note that in my modification, I am not using the _gaps trick from this link, meant to over-sample to offset sampling inefficiency if there are gaps in a certain index column. That part doesn't relate to this question, and in my case I am using row_number to ensure that there is an integer column with no possible gaps.

The CTE is recursive, to ensure that if the first, non-recursive part of the CTE doesn't give you enough rows (because of duplicates removed by the UNION), then it will go back through another round of recursive call of the CTE, and keep tacking on more results until you've got enough.

In the linked example, LIMIT is used, but I am finding that this does not work. The method returns fewer results because LIMIT is only an at most N rows guarantee.

How do you get an at least N rows guarantee? Selecting the TOP N rows seemed like the natural way to do this (so that the recursive CTE had to keep chugging along until it gets enough rows to satisfy the TOP condition), but this is not available in Postgres.

like image 310
ely Avatar asked Mar 20 '16 20:03

ely


2 Answers

This is too long for a comment, but is informative about what's going on with my existing query. From the documentation on recursive query evaluation, the steps that a recursive query will take are:

Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.

So long as the working table is not empty, repeat these steps:

a. Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.

b. Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.

So my hunch in the comments (after trying with UNION ALL) was mostly on the right track.

As the documentation states, this is actually just a type of iteration that re-uses the previous non-recursive result part in place of the recursive name used in the recursive part.

So it's more like an ever shrinking process, where the "working table" used to substitute in for the recursive name consists only of the particular subset of results at the most recent recursive round which were not duplicates of previous results.

Here's an example. Say we have 5000 rows in the table and we want to sample 3000 unique rows, doing a recursive sample of 1000 (potentially not unique) samples at a time.

We do the first batch of 1000, remove duplicates so we end up with about 818 non-dupes based on the binomial distribution for these large numbers (N=5000, m = 1000, k=1, rearrange terms to avoid overflow).

These 818 become the working table and this result set is subbed in as the recursive term for our next pass. We draw another set of about 818 unique rows, but then have to remove duplicates (UNION) when comparing to the original 818 that are in the working table. Two different random draws of 818 will have significant overlap (somewhere around 150 on average), so all of those are discarded and whatever new unique rows remain become the new working table.

So you'll add about 818 unique samples on the first draw, then the working table shrinks, you'll about 650 on the second draw, the working table shrinks, ... and you keep doing this until either you reach the overall total samples desired (3000 in this case) or the working table ends up being empty.

Once the working table is small enough, there will be a very high probability that everything within it will be duplicated in the next draw of 1000, at which point the working table becomes empty and the recursion terminates.

For drawing 3000 samples, you might be able to do this working table update enough times. But as you move from 3000 closer to the table's total size of 5000, the probability shrinks to nearly zero very fast.

So instead of this being an optimizer issue that's short-circuiting with a smaller result set, it's actually an issue with the specific way "recursion" is implemented in Postgres -- it's not actually recursion but simple iteration that operates on the set difference between the current working table and the most recent recursive query result set. For random sampling like this, that working table will get smaller very fast with each iteration, until it's extremely likely to be empty due to the high probability of selecting a duplicate.

like image 64
ely Avatar answered Sep 19 '22 09:09

ely


Your assessment is to the point. The recursive query in my referenced answer is only somewhat more flexible than the original simple query. It still requires relatively few gaps in the ID space and a sample size that is substantially smaller than the table size to be reliable.

While we need a comfortable surplus ("limit + buffer") in the simple query to cover the worst case scenario of misses and duplicates, we can work with a smaller surplus that's typically enough - since we have the safety net of a recursive query that will fill in if we should fall short of the limit in the first pass.

Either way, the technique is intended to get a small random selection from a big table quickly.

The technique is pointless for cases with too many gaps or (your focus) a sample size that is too close to the total table size - so that the recursive term might run dry before the limit is reached. For such cases a plain old:

SELECT *   -- or DISTINCT * to fold duplicates like UNION does
FROM   TABLE
ORDER  BY random()
LIMIT  n;

.. is more efficient: you are going to read most of the table anyway.

like image 24
Erwin Brandstetter Avatar answered Sep 23 '22 09:09

Erwin Brandstetter