Use something like TOP with GROUP BY

Tags:

With table table1 like below

+--------+-------+-------+------------+-------+
| flight |  orig |  dest |  passenger |  bags |
+--------+-------+-------+------------+-------+
|   1111 |  sfo  |  chi  |  david     |     3 |
|   1112 |  sfo  |  dal  |  david     |     7 |
|   1112 |  sfo  |  dal  |  kim       |     10|
|   1113 |  lax  |  san  |  ameera    |     5 |
|   1114 |  lax  |  lfr  |  tim       |     6 |
|   1114 |  lax  |  lfr  |  jake      |     8 |
+--------+-------+-------+------------+-------+

I'm aggregating the table by orig like below

Click to copy

select 
  orig
  , count(*) flight_cnt
  , count(distinct passenger) as pass_cnt
  , percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from table1
group by orig

I need to add the passenger with the longest name ( length(passenger) ) for each orig group - how do I go about it?

Output expected

Click to copy

+------+-------------+-----------+---------------+-------------------+
| orig |  flight_cnt |  pass_cnt |  bags_cnt_med | pass_max_len_name |
+------+-------------+-----------+---------------+-------------------+
| sfo  |           3 |         2 |             7 |  david            |
| lax  |           3 |         3 |             6 | ameera            |
+------+-------------+-----------+---------------+-------------------+

944

asked May 12 '17 09:05

user3206440

1 Answers

You can conveniently retrieve the passenger with the longest name per group with DISTINCT ON.

Select first row in each GROUP BY group?

But I see no way to combine that (or any other simple way) with your original query in a single SELECT. I suggest to join two separate subqueries:

Click to copy

SELECT *
FROM  (  -- your original query
   SELECT orig
        , count(*) AS flight_cnt
        , count(distinct passenger) AS pass_cnt
        , percentile_cont(0.5) WITHIN GROUP (ORDER BY bags) AS bag_cnt_med
   FROM   table1
   GROUP  BY orig
   ) org_query
JOIN  (  -- my addition
   SELECT DISTINCT ON (orig) orig, passenger AS pass_max_len_name
   FROM   table1
   ORDER  BY orig, length(passenger) DESC NULLS LAST
   ) pas USING (orig);

USING in the join clause conveniently only outputs one instance of orig, so you can simply use SELECT * in the outer SELECT.

If passenger can be NULL, it is important to add NULLS LAST:

PostgreSQL sort by datetime asc, null first?

From multiple passenger names with the same maximum length in the same group, you get an arbitrary pick - unless you add more expressions to ORDER BY as tiebreaker. Detailed explanation in the answer linked above.

Performance?

Typically, a single scan is superior, especially with sequential scans.

The above query uses two scans (maybe index / index-only scans). But the second scan is comparatively cheap unless the table is too huge to fit in cache (mostly). Lukas suggested an alternative query with only a single SELECT adding:

Click to copy

, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1]  -- I'd add NULLS LAST

The idea is smart, but last time I tested, array_agg with ORDER BY did not perform so well. (The overhead of per-group ORDER BY is substantial, and array handling is expensive, too.)

The same approach can be cheaper with a custom aggregate function first() like instructed in the Postgres Wiki here. Or, faster, yet, with a version written in C, available on PGXN. Eliminates the extra cost for array handling, but we still need per-group ORDER BY. May be faster for only few groups. You would then add:

Click to copy

 , first(passenger ORDER BY length(passenger) DESC NULLS LAST)

Gordon and Lukas also mention the window function first_value(). Window functions are applied after aggregate functions. To use it in the same SELECT, we would need to aggregate passenger somehow first - catch 22. Gordon solves this with a subquery - another candidate for good performance with standard Postgres.

first() does the same without subquery and should be simpler and a bit faster. But it still won't be faster than a separate DISTINCT ON for most cases with few rows per group. For lots of rows per group, a recursive CTE technique is typically faster. There are yet faster techniques if you have a separate table holding all relevant, unique orig values. Details:

Optimize GROUP BY query to retrieve latest record per user

The best solution depends on various factors. The proof of the pudding is in the eating. To optimize performance you have to test with your setup. The above query should be among the fastest.

140

answered Nov 15 '22 21:11

Erwin Brandstetter

Related questions
                            
                                Django Queryset __in with None value in list
                            
                                Alter all tables in database
                            
                                Django object not saving even after "save" call
                            
                                SQL Server 2012: Add a linked server to PostgreSQL
                            
                                How to use pandas to group pivot table results by week?
                            
                                Oracle: Single multicolumn index or two single column indexes
                            
                                LINQ to Entities equivalent of sql "TOP(n) WITH TIES"
                            
                                WPF C# application using local database
                            
                                Remove Rows That Sum Zero For A Given Key
                            
                                What is SQL equivalent to LINQ .All()
                            
                                How to search for a text? (MySQL)
                            
                                What happens if you don't close a pyodbc connection?
                            
                                Optimizing window function in PostgreSQL to use index
                            
                                SQL how to merge similar records into single row from same table?
                            
                                Why WHEN MATCHED' cannot appear more than once in a 'UPDATE' clause of a MERGE statement?
                            
                                Update multiple second Id's to match the first Id's of a series
                            
                                MySQL REPLACE affects 0 rows but WHERE ... LIKE returns 90
                            
                                Left outer join with only first row
                            
                                Are temporary tables in postgresql visible over all client sessions?
                            
                                How do I delete duplicates, and update the records that refer to those duplicates in SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use something like TOP with GROUP BY

Tags:

sql

postgresql

greatest-n-per-group

aggregate

user3206440

People also ask

1 Answers

Performance?

Erwin Brandstetter

Recent Activity

Donate For Us