We have many people, these people go on trips with multiple stages/states (initially planning, then started, then returned_safely or ended in disaster).
I have a query which gets the right results, you can see it and play with it here:
http://sqlfiddle.com/#!15/2e096/1
However, I'm wondering if there's a better implementation, specifically one avoiding the use of GROUP BY
and postgres' bool_and
, potentially also avoiding a nested query.
Who has never experienced a trip they didn't return safely from?
Or, put another way:
Who has either:
1. Never planned or gone on a trip
OR2. only ever returned safely
Clarifications
Output
Should be at least all the columns from the person
table, if other columns come out too, that's fine.
CREATE TABLE people (person_name text, gender text, age integer);
INSERT INTO people (person_name, gender, age)
VALUES ('pete', 'm', 10), ('alan', 'm', 22), ('jess', 'f', 24), ('agnes', 'f', 25), ('matt', 'm', 26);
CREATE TABLE trips (person_name text, trip_name text);
INSERT INTO trips (person_name, trip_name)
VALUES ('pete', 'a'),
('pete', 'b'),
('alan', 'c'),
('alan', 'd'),
('jess', 'e'),
('matt', 'f');
CREATE TABLE trip_stages (trip_name text, stage text, most_recent boolean);
INSERT INTO trip_stages
VALUES ('a', 'started', 'f'), ('a', 'disaster', 't'),
('b', 'started', 't'),
('c', 'started', 'f'), ('c', 'safe_return', 't'),
('e', 'started', 'f'), ('e', 'safe_return', 't');
person_name | gender | age
-------------+--------+-----
jess | f | 24
agnes | f | 25
SELECT people.* FROM people WHERE people.person_name IN (
SELECT people.person_name FROM people
LEFT OUTER JOIN trips
ON trips.person_name = people.person_name
LEFT OUTER JOIN trip_stages
ON trip_stages.trip_name = trips.trip_name AND trip_stages.most_recent = 't'
GROUP BY people.person_name
HAVING bool_and(trips.trip_name IS NULL)
OR bool_and(trip_stages.stage IS NOT NULL AND trip_stages.stage = 'safe_return')
)
SELECT people.* FROM people WHERE people.person_name IN (
-- All the people
SELECT people.person_name FROM people
-- + All their trips
LEFT OUTER JOIN trips
ON trips.person_name = people.person_name
-- + All those trips' stages
LEFT OUTER JOIN trip_stages
ON trip_stages.trip_name = trips.trip_name AND trip_stages.most_recent = 't'
-- Group by person
GROUP BY people.person_name
-- Filter to those rows where either:
-- 1. trip_name is always NULL (they've made no trips)
-- 2. Every trip has been ended with a safe return
HAVING bool_and(trips.trip_name IS NULL)
OR bool_and(trip_stages.stage IS NOT NULL AND trip_stages.stage = 'safe_return')
)
Is there another way I could write this query? Without using GROUP BY
and bool_and
and ideally without using subqueries too? Perhaps some partition/window function?
I'm using this to learn, so explanations/analysis of queries appreciated!
I'm particularly interested in performance implications. e.g. What happens if people take thousands of trips? Do subqueries get out-performed by some other approach?
SELECT p0.person_name FROM people p0
WHERE p0.person_name NOT IN (
SELECT p.person_name FROM people p
INNER JOIN trips t on p.person_name = t.person_name
LEFT JOIN trip_stages s on t.trip_name = s.trip_name AND s.most_recent
WHERE s.stage IS NULL OR s.stage != 'safe_return' );
FIDDLE
Much easier to retrieve who doesn't fit and use a NOT IN
.
EDIT: With the understanding that I am unable to be as concise in prose as I am in code, an expanded explanation per IMSoP's suggestion:
SELECT p0.person_name FROM people p0
-- The outer query exists to reverse the results of the inner query. The inner query
-- returns person names which have not arrived safely, the outer query returns the names,
-- via the NOT IN operator, which don't result from the inner query.
WHERE p0.person_name NOT IN (
SELECT p.person_name FROM people p
-- Selecting from the same table via a different alias (p vs p0) is useful for avoiding
-- ambiguity.
INNER JOIN trips t on p.person_name = t.person_name
-- The INNER JOIN returns results only where a value in people.person_name matches the
-- trips.person_name. This has the effect of removing any person_names from the inner
-- query who haven't taken any trips.
LEFT JOIN trip_stages s on t.trip_name = s.trip_name AND s.most_recent
-- The LEFT JOIN links any rows created from the previous INNER JOIN to the trip_stages
-- table where trips. The terms of the LEFT JOIN restrict the matches the rows where the
-- most_recent column is true. Unlike the INNER JOIN, the LEFT JOIN does not eliminate
-- rows where there is no match. Where there is no match, the columns from the left side
-- of the join are still populated, those from the right side of the join are NULL.
WHERE s.stage IS NULL OR s.stage != 'safe_return'
-- s.stage IS NULL indicates that, via the LEFT JOIN above, a trip was planned but not
-- begun. As we are specifying that the trip stage we are looking at is the last one
-- recorded, any value other than safe_return indicates that the row we are looking at
-- does not meet the conditions set by OP, and is thus to be included for elimination by
-- the outer query.
);
SELECT distinct trips.person_name
FROM trips
RIGHT JOIN trip_stages
ON trips.trip_name = trip_stages.trip_name
WHERE trip_stages.most_recent = 't'
GROUP BY trips.person_name, trip_stages.stage
HAVING trip_stages.stage is not null
AND trip_stages.stage = 'safe_return'
Sorry for did multiple edits but got it working to get only the names of people here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With