Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL: How to figure out missing numbers in a column using generate_series()?

SELECT commandid 
FROM results 
WHERE NOT EXISTS (
    SELECT * 
    FROM generate_series(0,119999) 
    WHERE generate_series = results.commandid 
    );

I have a column in results of type int but various tests failed and were not added to the table. I would like to create a query that returns a list of commandid that are not found in results. I thought the above query would do what I wanted. However, it does not even work if I use a range that is outside the expected possible range of commandid (like negative numbers).

like image 602
Manuel Zubieta Avatar asked Sep 16 '12 03:09

Manuel Zubieta


2 Answers

Given sample data:

create table results ( commandid integer primary key);
insert into results (commandid) select * from generate_series(1,1000);
delete from results where random() < 0.20;

This works:

SELECT s.i AS missing_cmd
FROM generate_series(0,1000) s(i)
WHERE NOT EXISTS (SELECT 1 FROM results WHERE commandid = s.i);

as does this alternative formulation:

SELECT s.i AS missing_cmd
FROM generate_series(0,1000) s(i)
LEFT OUTER JOIN results ON (results.commandid = s.i) 
WHERE results.commandid IS NULL;

Both of the above appear to result in identical query plans in my tests, but you should compare with your data on your database using EXPLAIN ANALYZE to see which is best.

Explanation

Note that instead of NOT IN I've used NOT EXISTS with a subquery in one formulation, and an ordinary OUTER JOIN in the other. It's much easier for the DB server to optimise these and it avoids the confusing issues that can arise with NULLs in NOT IN.

I initially favoured the OUTER JOIN formulation, but at least in 9.1 with my test data the NOT EXISTS form optimizes to the same plan.

Both will perform better than the NOT IN formulation below when the series is large, as in your case. NOT IN used to require Pg to do a linear search of the IN list for every tuple being tested, but examination of the query plan suggests Pg may be smart enough to hash it now. The NOT EXISTS (transformed into a JOIN by the query planner) and the JOIN work better.

The NOT IN formulation is both confusing in the presence of NULL commandids and can be inefficient:

SELECT s.i AS missing_cmd
FROM generate_series(0,1000) s(i)
WHERE s.i NOT IN (SELECT commandid FROM results);

so I'd avoid it. With 1,000,000 rows the other two completed in 1.2 seconds and the NOT IN formulation ran CPU-bound until I got bored and cancelled it.

like image 91
Craig Ringer Avatar answered Oct 08 '22 18:10

Craig Ringer


As I mentioned in the comment, you need to do the reverse of the above query.

SELECT
    generate_series
FROM
    generate_series(0, 119999)
WHERE
    NOT generate_series IN (SELECT commandid FROM results);

At that point, you should find values that do not exist within the commandid column within the selected range.

like image 34
pickypg Avatar answered Oct 08 '22 18:10

pickypg