I'm try to get the mode()
for a grouped data set, but without grouping the results. (Using Postgres 9.5, can upgrade if needed.)
e.g. Users have a 'favorite color', and belong to a single group. Get list of users with the mode()
'favorite color' within their group.
A window function would work for most aggregates, but mode()
seems to be an exception that isn't compatible with window functions. Is there another way to go about this? Here's what I've been toying with so far ...
Works but gives grouped results, I'm looking for the results to be ungrouped:
SELECT group_id,
mode() WITHIN GROUP (ORDER BY color)
FROM users
GROUP BY group_id;
Invalid syntax (just an example of what I'm trying to accomplish):
SELECT id, color, group_id,
mode(color) OVER (PARTITION BY group_id)
FROM users;
Or:
SELECT id, color, group_id,
mode() WITHIN GROUP (ORDER BY color) OVER (PARTITION BY group_id)
FROM users;
I tried using a lateral join, but couldn't get it to work right without re-iterating my WHERE
clause both inside and outside the join (which I'd prefer not to do for when this query gets more complicated):
SELECT u1.id, u1.group_id, u1.color, mode_color
FROM users u1
LEFT JOIN LATERAL
(SELECT group_id, mode() WITHIN GROUP (ORDER BY color) as mode_color
FROM users
WHERE group_id = d1.group_id
GROUP BY group_id)
u2 ON u1.group_id = u2.group_id
WHERE u1.type = 'customer';
It's important that WHERE u1.type = 'customer'
stays outside of the subquery, as that's being appended to the query at a later point, after the first half of it is already written.
Window functions are permitted only in the SELECT list and the ORDER BY clause of the query. They are forbidden elsewhere, such as in GROUP BY , HAVING and WHERE clauses. This is because they logically execute after the processing of those clauses.
From PostgreSQL wiki. The statistical mode is the value that appears most often in a set of values.
Window functions are permitted only in the select list and ORDER BY clause. Query result rows are determined from the FROM clause, after WHERE , GROUP BY , and HAVING processing, and windowing execution occurs before ORDER BY , LIMIT , and SELECT DISTINCT . For details about each aggregate function, see Section 12.20.
We are talking about the ordered-set aggregate function mode(), introduced with Postgres 9.4. You probably saw this error message:
ERROR: OVER is not supported for ordered-set aggregate mode
We can work around it. But which mode exactly?
(All assuming group_id
and type
are NOT NULL
, else you need to do more.)
This computes the mode based on the filtered set (with type = 'customer'
) alone.
You get the most popular color per group among "customers".
A subquery in a plain JOIN
(without LEFT
and LATERAL
in this case) would do the job - calculating the mode once per group, not for every individual row:
SELECT u1.id, u1.group_id, u1.color, u2.mode_color
FROM users u1
JOIN ( -- not LATERAL
SELECT group_id, type -- propagate out for the join
, mode() WITHIN GROUP (ORDER BY color) AS mode_color
FROM users
WHERE type = 'customer' -- place condition in subquery (cheap)
GROUP BY group_id, type
) u2 USING (group_id, type); -- shorthand syntax for matching names
-- WHERE type = 'customer' -- or filter later (expensive)
To avoid repeating your condition, place it in the subquery and propagate it to the outer query in the join clause - I picked matching column names and joined with USING
in my example.
You can move the condition to the outer query or even to a later step, yet. It will be needlessly more expensive, though, as the mode for every combination of (group_id, type)
has to be calculated, before the results for every other type is excluded in a later step.
There are ways to parameterize your query. Prepared statements, PL/pgSQL function, see:
Or, if the underlying table does not change much, a materialized view with all pre-computed modes per (group_id, type)
replacing the subquery would be an option.
One more option: use a CTE to filter qualifying rows first, then the WHERE
condition can stay outside of the subquery like you requested:
WITH cte AS ( -- filter result rows first
SELECT id, group_id, color
FROM users u1
WHERE type = 'customer' -- predicate goes here
)
SELECT *
FROM cte u1
LEFT JOIN ( -- or JOIN, doesn't matter here
SELECT group_id
, mode() WITHIN GROUP (ORDER BY color) AS mode_color
FROM cte -- based on only qualifying rows
GROUP BY 1
) u2 USING (group_id);
We can simplify with SELECT *
since USING
conveniently places only one group_id
in the result set.
If you want to base the mode on all rows (including those where type = 'customer'
is not true), you need a different query.
You get the most popular color per group among all members.
Move the WHERE
clause to the outer query:
SELECT u1.id, u1.group_id, u1.color, u2.mode_color
FROM users u1
LEFT JOIN ( -- or JOIN, doesn't matter here
SELECT group_id
, mode() WITHIN GROUP (ORDER BY color) AS mode_color
FROM users
GROUP BY group_id
) u2 USING (group_id)
WHERE u1.type = 'customer';
If your predicate (type = 'customer'
) is selective enough, computing the mode for all groups may be a waste. Filter the small subset first and only compute the mode for contained groups. Add a CTE for this:
WITH cte AS ( -- filter result rows first
SELECT id, group_id, color
FROM users u1
WHERE type = 'customer'
)
SELECT *
FROM cte u1
LEFT JOIN ( -- or JOIN
SELECT group_id
, mode() WITHIN GROUP (ORDER BY color) AS mode_color
FROM (SELECT DISTINCT group_id FROM cte) g -- only relevant groups
JOIN users USING (group_id) -- but consider all rows for those
GROUP BY 1
) u2 USING (group_id);
Similar to the CTE query above, but based on all group members in the base table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With