With table table1
like below
+--------+-------+-------+------------+-------+
| flight | orig | dest | passenger | bags |
+--------+-------+-------+------------+-------+
| 1111 | sfo | chi | david | 3 |
| 1112 | sfo | dal | david | 7 |
| 1112 | sfo | dal | kim | 10|
| 1113 | lax | san | ameera | 5 |
| 1114 | lax | lfr | tim | 6 |
| 1114 | lax | lfr | jake | 8 |
+--------+-------+-------+------------+-------+
I'm aggregating the table by orig
like below
select
orig
, count(*) flight_cnt
, count(distinct passenger) as pass_cnt
, percentile_cont(0.5) within group ( order by bags ASC) as bag_cnt_med
from table1
group by orig
I need to add the passenger
with the longest name ( length(passenger)
) for each orig
group - how do I go about it?
Output expected
+------+-------------+-----------+---------------+-------------------+
| orig | flight_cnt | pass_cnt | bags_cnt_med | pass_max_len_name |
+------+-------------+-----------+---------------+-------------------+
| sfo | 3 | 2 | 7 | david |
| lax | 3 | 3 | 6 | ameera |
+------+-------------+-----------+---------------+-------------------+
The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.
Using Group By and Order By Together When combining the Group By and Order By clauses, it is important to bear in mind that, in terms of placement within a SELECT statement: The GROUP BY clause is placed after the WHERE clause. The GROUP BY clause is placed before the ORDER BY clause.
GROUP BY Clause is utilized with the SELECT statement. GROUP BY aggregates the results on the basis of selected column: COUNT, MAX, MIN, SUM, AVG, etc. GROUP BY returns only one result per group of data. GROUP BY Clause always follows the WHERE Clause.
SQL Sub-query as a GROUP BY and HAVING Alternative You can use a sub-query to remove the GROUP BY from the query which is using SUM aggregate function. There are many types of subqueries in Hive, but, you can use correlated subquery to calculate sum part.
You can conveniently retrieve the passenger with the longest name per group with DISTINCT ON
.
But I see no way to combine that (or any other simple way) with your original query in a single SELECT
. I suggest to join two separate subqueries:
SELECT *
FROM ( -- your original query
SELECT orig
, count(*) AS flight_cnt
, count(distinct passenger) AS pass_cnt
, percentile_cont(0.5) WITHIN GROUP (ORDER BY bags) AS bag_cnt_med
FROM table1
GROUP BY orig
) org_query
JOIN ( -- my addition
SELECT DISTINCT ON (orig) orig, passenger AS pass_max_len_name
FROM table1
ORDER BY orig, length(passenger) DESC NULLS LAST
) pas USING (orig);
USING
in the join clause conveniently only outputs one instance of orig
, so you can simply use SELECT *
in the outer SELECT
.
If passenger
can be NULL, it is important to add NULLS LAST
:
From multiple passenger names with the same maximum length in the same group, you get an arbitrary pick - unless you add more expressions to ORDER BY
as tiebreaker. Detailed explanation in the answer linked above.
Typically, a single scan is superior, especially with sequential scans.
The above query uses two scans (maybe index / index-only scans). But the second scan is comparatively cheap unless the table is too huge to fit in cache (mostly). Lukas suggested an alternative query with only a single SELECT
adding:
, (ARRAY_AGG (passenger ORDER BY LENGTH (passenger) DESC))[1] -- I'd add NULLS LAST
The idea is smart, but last time I tested, array_agg
with ORDER BY
did not perform so well. (The overhead of per-group ORDER BY
is substantial, and array handling is expensive, too.)
The same approach can be cheaper with a custom aggregate function first()
like instructed in the Postgres Wiki here. Or, faster, yet, with a version written in C, available on PGXN. Eliminates the extra cost for array handling, but we still need per-group ORDER BY
. May be faster for only few groups. You would then add:
, first(passenger ORDER BY length(passenger) DESC NULLS LAST)
Gordon and Lukas also mention the window function first_value()
. Window functions are applied after aggregate functions. To use it in the same SELECT
, we would need to aggregate passenger
somehow first - catch 22. Gordon solves this with a subquery - another candidate for good performance with standard Postgres.
first()
does the same without subquery and should be simpler and a bit faster. But it still won't be faster than a separate DISTINCT ON
for most cases with few rows per group. For lots of rows per group, a recursive CTE technique is typically faster. There are yet faster techniques if you have a separate table holding all relevant, unique orig
values. Details:
The best solution depends on various factors. The proof of the pudding is in the eating. To optimize performance you have to test with your setup. The above query should be among the fastest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With