Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get mode() in a window function in Postgres?

Tags:

I'm try to get the mode() for a grouped data set, but without grouping the results. (Using Postgres 9.5, can upgrade if needed.)

e.g. Users have a 'favorite color', and belong to a single group. Get list of users with the mode() 'favorite color' within their group.

A window function would work for most aggregates, but mode() seems to be an exception that isn't compatible with window functions. Is there another way to go about this? Here's what I've been toying with so far ...

Works but gives grouped results, I'm looking for the results to be ungrouped:

SELECT group_id, 
    mode() WITHIN GROUP (ORDER BY color)
FROM users
GROUP BY group_id;

Invalid syntax (just an example of what I'm trying to accomplish):

SELECT id, color, group_id, 
    mode(color) OVER (PARTITION BY group_id)
FROM users;

Or:

SELECT id, color, group_id, 
    mode() WITHIN GROUP (ORDER BY color) OVER (PARTITION BY group_id)
FROM users;

I tried using a lateral join, but couldn't get it to work right without re-iterating my WHERE clause both inside and outside the join (which I'd prefer not to do for when this query gets more complicated):

SELECT u1.id, u1.group_id, u1.color, mode_color
FROM users u1
LEFT JOIN LATERAL
    (SELECT group_id, mode() WITHIN GROUP (ORDER BY color) as mode_color
     FROM users
     WHERE group_id = d1.group_id
     GROUP BY group_id)
    u2 ON u1.group_id = u2.group_id
WHERE u1.type = 'customer';

It's important that WHERE u1.type = 'customer' stays outside of the subquery, as that's being appended to the query at a later point, after the first half of it is already written.

like image 310
PeanutsMcgee Avatar asked Apr 05 '19 19:04

PeanutsMcgee


People also ask

Does window function work in PostgreSQL?

Window functions are permitted only in the SELECT list and the ORDER BY clause of the query. They are forbidden elsewhere, such as in GROUP BY , HAVING and WHERE clauses. This is because they logically execute after the processing of those clauses.

What is mode in PostgreSQL?

From PostgreSQL wiki. The statistical mode is the value that appears most often in a set of values.

Can we use window function with where clause?

Window functions are permitted only in the select list and ORDER BY clause. Query result rows are determined from the FROM clause, after WHERE , GROUP BY , and HAVING processing, and windowing execution occurs before ORDER BY , LIMIT , and SELECT DISTINCT . For details about each aggregate function, see Section 12.20.


1 Answers

We are talking about the ordered-set aggregate function mode(), introduced with Postgres 9.4. You probably saw this error message:

ERROR:  OVER is not supported for ordered-set aggregate mode

We can work around it. But which mode exactly?

(All assuming group_id and type are NOT NULL, else you need to do more.)

Mode of qualifying rows

This computes the mode based on the filtered set (with type = 'customer') alone.
You get the most popular color per group among "customers".

A subquery in a plain JOIN (without LEFT and LATERAL in this case) would do the job - calculating the mode once per group, not for every individual row:

SELECT u1.id, u1.group_id, u1.color, u2.mode_color
FROM   users u1
JOIN  (                            -- not LATERAL
   SELECT group_id, type           -- propagate out for the join
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM   users 
   WHERE  type = 'customer'        -- place condition in subquery (cheap)
   GROUP  BY group_id, type
   ) u2 USING (group_id, type);    -- shorthand syntax for matching names
-- WHERE  type = 'customer'        -- or filter later (expensive)

To avoid repeating your condition, place it in the subquery and propagate it to the outer query in the join clause - I picked matching column names and joined with USING in my example.

You can move the condition to the outer query or even to a later step, yet. It will be needlessly more expensive, though, as the mode for every combination of (group_id, type) has to be calculated, before the results for every other type is excluded in a later step.

There are ways to parameterize your query. Prepared statements, PL/pgSQL function, see:

  • Split given string and prepare case statement

Or, if the underlying table does not change much, a materialized view with all pre-computed modes per (group_id, type) replacing the subquery would be an option.

One more option: use a CTE to filter qualifying rows first, then the WHERE condition can stay outside of the subquery like you requested:

WITH cte AS (  -- filter result rows first
   SELECT id, group_id, color
   FROM   users u1
   WHERE  type = 'customer'        -- predicate goes here
   )
SELECT *
FROM   cte u1
LEFT   JOIN (                      -- or JOIN, doesn't matter here
   SELECT group_id
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM   cte                      -- based on only qualifying rows
   GROUP  BY 1
   ) u2 USING (group_id);

We can simplify with SELECT * since USING conveniently places only one group_id in the result set.

Mode of all rows

If you want to base the mode on all rows (including those where type = 'customer' is not true), you need a different query.
You get the most popular color per group among all members.

Move the WHERE clause to the outer query:

SELECT u1.id, u1.group_id, u1.color, u2.mode_color
FROM   users u1
LEFT   JOIN (                      -- or JOIN, doesn't matter here
   SELECT group_id
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM   users
   GROUP  BY group_id
   ) u2 USING (group_id)
WHERE  u1.type = 'customer';

If your predicate (type = 'customer') is selective enough, computing the mode for all groups may be a waste. Filter the small subset first and only compute the mode for contained groups. Add a CTE for this:

WITH cte AS (  -- filter result rows first
   SELECT id, group_id, color
   FROM   users u1
   WHERE  type = 'customer'
   )
SELECT *
FROM   cte u1
LEFT   JOIN (        -- or JOIN
   SELECT group_id
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM  (SELECT DISTINCT group_id FROM cte) g  -- only relevant groups
   JOIN   users USING (group_id)                -- but consider all rows for those
   GROUP  BY 1
   ) u2 USING (group_id);

Similar to the CTE query above, but based on all group members in the base table.

like image 192
Erwin Brandstetter Avatar answered Oct 09 '22 19:10

Erwin Brandstetter