Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving on GROUP BY in SQL

Tags:

sql

postgresql

Scenario

We have many people, these people go on trips with multiple stages/states (initially planning, then started, then returned_safely or ended in disaster).

I have a query which gets the right results, you can see it and play with it here:

http://sqlfiddle.com/#!15/2e096/1

However, I'm wondering if there's a better implementation, specifically one avoiding the use of GROUP BY and postgres' bool_and, potentially also avoiding a nested query.

What we want to know

Who has never experienced a trip they didn't return safely from?

Or, put another way:

Who has either: 1. Never planned or gone on a trip OR 2. only ever returned safely

Clarifications

  • If there's a record for a person in the trips table, but no stages, they're planning a trip.

Output

Should be at least all the columns from the person table, if other columns come out too, that's fine.

Setup/Reproduction

CREATE TABLE people (person_name text, gender text, age integer);
INSERT INTO people (person_name, gender, age)
  VALUES ('pete', 'm', 10), ('alan', 'm', 22), ('jess', 'f', 24), ('agnes', 'f', 25), ('matt', 'm', 26);

CREATE TABLE trips (person_name text, trip_name text);
INSERT INTO trips (person_name, trip_name)
  VALUES ('pete', 'a'),
         ('pete', 'b'),
         ('alan', 'c'),
         ('alan', 'd'),
         ('jess', 'e'),
         ('matt', 'f');

CREATE TABLE trip_stages (trip_name text, stage text, most_recent boolean);
INSERT INTO trip_stages
  VALUES ('a', 'started', 'f'), ('a', 'disaster', 't'),
         ('b', 'started', 't'),
         ('c', 'started', 'f'), ('c', 'safe_return', 't'),
         ('e', 'started', 'f'), ('e', 'safe_return', 't');

Summary of the situation

  • Pete has one trip which ended in disaster and one he just started
  • Alan has one trip he returned from safely and one he is planning
  • Jess has been on one trip, which she returned safely from
  • Agnes has never even planned a trip
  • Matt has planned a trip, but not started it yet

Solution

 person_name | gender | age
-------------+--------+-----
 jess        | f      | 24
 agnes       | f      | 25
  • Jess (has been on one trip, which she returned safely from)
  • Agnes (has never even planned a trip)

Working Query

SELECT people.* FROM people WHERE people.person_name IN (
  SELECT people.person_name FROM people
  LEFT OUTER JOIN trips
    ON trips.person_name = people.person_name
  LEFT OUTER JOIN trip_stages
    ON trip_stages.trip_name = trips.trip_name AND trip_stages.most_recent = 't'
  GROUP BY people.person_name
    HAVING bool_and(trips.trip_name IS NULL)
      OR bool_and(trip_stages.stage IS NOT NULL AND trip_stages.stage = 'safe_return')
)

Explanation

SELECT people.* FROM people WHERE people.person_name IN (
  -- All the people
  SELECT people.person_name FROM people

  -- + All their trips
  LEFT OUTER JOIN trips
    ON trips.person_name = people.person_name

  -- + All those trips' stages
  LEFT OUTER JOIN trip_stages
    ON trip_stages.trip_name = trips.trip_name AND trip_stages.most_recent = 't'

  -- Group by person
  GROUP BY people.person_name
    -- Filter to those rows where either:
    --   1. trip_name is always NULL (they've made no trips)
    --   2. Every trip has been ended with a safe return
    HAVING bool_and(trips.trip_name IS NULL)
      OR bool_and(trip_stages.stage IS NOT NULL AND trip_stages.stage = 'safe_return')
)

Question

Is there another way I could write this query? Without using GROUP BY and bool_and and ideally without using subqueries too? Perhaps some partition/window function?

I'm using this to learn, so explanations/analysis of queries appreciated!

I'm particularly interested in performance implications. e.g. What happens if people take thousands of trips? Do subqueries get out-performed by some other approach?

like image 752
Pete Hamilton Avatar asked May 01 '15 20:05

Pete Hamilton


2 Answers

SELECT p0.person_name FROM people p0
WHERE p0.person_name NOT IN (
 SELECT p.person_name FROM people p
 INNER JOIN trips t on p.person_name = t.person_name
 LEFT JOIN trip_stages s on t.trip_name = s.trip_name AND s.most_recent
 WHERE s.stage IS NULL OR s.stage != 'safe_return' );

FIDDLE

Much easier to retrieve who doesn't fit and use a NOT IN.

EDIT: With the understanding that I am unable to be as concise in prose as I am in code, an expanded explanation per IMSoP's suggestion:

SELECT p0.person_name FROM people p0
-- The outer query exists to reverse the results of the inner query. The inner query
-- returns person names which have not arrived safely, the outer query returns the names,
-- via the NOT IN operator, which don't result from the inner query.
WHERE p0.person_name NOT IN (
 SELECT p.person_name FROM people p
-- Selecting from the same table via a different alias (p vs p0) is useful for avoiding
-- ambiguity.
 INNER JOIN trips t on p.person_name = t.person_name
-- The INNER JOIN returns results only where a value in people.person_name matches the
-- trips.person_name. This has the effect of removing any person_names from the inner
-- query who haven't taken any trips.
 LEFT JOIN trip_stages s on t.trip_name = s.trip_name AND s.most_recent
-- The LEFT JOIN links any rows created from the previous INNER JOIN to the trip_stages
-- table where trips. The terms of the LEFT JOIN restrict the matches the rows where the
-- most_recent column is true. Unlike the INNER JOIN, the LEFT JOIN does not eliminate
-- rows where there is no match. Where there is no match, the columns from the left side
-- of the join are still populated, those from the right side of the join are NULL.
 WHERE s.stage IS NULL OR s.stage != 'safe_return'
-- s.stage IS NULL indicates that, via the LEFT JOIN above, a trip was planned but not
-- begun. As we are specifying that the trip stage we are looking at is the last one
-- recorded, any value other than safe_return indicates that the row we are looking at
-- does not meet the conditions set by OP, and is thus to be included for elimination by
-- the outer query.
);
like image 99
Politank-Z Avatar answered Oct 05 '22 23:10

Politank-Z


SELECT distinct trips.person_name 
  FROM trips 
RIGHT JOIN trip_stages 
  ON trips.trip_name = trip_stages.trip_name 
WHERE trip_stages.most_recent = 't' 
  GROUP BY trips.person_name, trip_stages.stage 
  HAVING trip_stages.stage is not null 
  AND trip_stages.stage = 'safe_return'

Sorry for did multiple edits but got it working to get only the names of people here

like image 28
Kesavacharan Avatar answered Oct 06 '22 01:10

Kesavacharan