I'm try to get the <code>mode()</code> for a grouped data set, but without grouping the results. (Using Postgres 9.5, can upgrade if needed.) e.g. Users have a 'favorite color', and belong to a single group. Get list of users with the <code>mode()</code> 'favorite color' within their group. A window function would work for most aggregates, but <code>mode()</code> seems to be an exception that isn't compatible with window functions. Is there another way to go about this? Here's what I've been toying with so far ... Works but gives grouped results, I'm looking for the results to be ungrouped: <pre class="prettyprint"><code>SELECT group_id, mode() WITHIN GROUP (ORDER BY color) FROM users GROUP BY group_id; </code></pre> Invalid syntax (just an example of what I'm trying to accomplish): <pre class="prettyprint"><code>SELECT id, color, group_id, mode(color) OVER (PARTITION BY group_id) FROM users; </code></pre> Or: <pre class="prettyprint"><code>SELECT id, color, group_id, mode() WITHIN GROUP (ORDER BY color) OVER (PARTITION BY group_id) FROM users; </code></pre> I tried using a lateral join, but couldn't get it to work right without re-iterating my <code>WHERE</code> clause both inside and outside the join (which I'd prefer not to do for when this query gets more complicated): <pre class="prettyprint"><code>SELECT u1.id, u1.group_id, u1.color, mode_color FROM users u1 LEFT JOIN LATERAL (SELECT group_id, mode() WITHIN GROUP (ORDER BY color) as mode_color FROM users WHERE group_id = d1.group_id GROUP BY group_id) u2 ON u1.group_id = u2.group_id WHERE u1.type = 'customer'; </code></pre> It's important that <code>WHERE u1.type = 'customer'</code> stays outside of the subquery, as that's being appended to the query at a later point, after the first half of it is already written.

We are talking about the ordered-set aggregate function mode(), introduced with Postgres 9.4. You probably saw this error message: <blockquote> <pre class="prettyprint"><code>ERROR: OVER is not supported for ordered-set aggregate mode </code></pre> </blockquote> We can work around it. But which mode exactly? (All assuming <code>group_id</code> and <code>type</code> are <code>NOT NULL</code>, else you need to do more.) <h3>Mode of qualifying rows</h3> This computes the mode based on the filtered set (with <code>type = 'customer'</code>) alone. You get the most popular color per group among "customers". A subquery in a plain <code>JOIN</code> (without <code>LEFT</code> and <code>LATERAL</code> in this case) would do the job - calculating the mode once per group, not for every individual row: <pre class="prettyprint"><code>SELECT u1.id, u1.group_id, u1.color, u2.mode_color FROM users u1 JOIN ( -- not LATERAL SELECT group_id, type -- propagate out for the join , mode() WITHIN GROUP (ORDER BY color) AS mode_color FROM users WHERE type = 'customer' -- place condition in subquery (cheap) GROUP BY group_id, type ) u2 USING (group_id, type); -- shorthand syntax for matching names -- WHERE type = 'customer' -- or filter later (expensive) </code></pre> To avoid repeating your condition, place it in the subquery and propagate it to the outer query in the join clause - I picked matching column names and joined with <code>USING</code> in my example. You can move the condition to the outer query or even to a later step, yet. It will be needlessly more expensive, though, as the mode for every combination of <code>(group_id, type)</code> has to be calculated, before the results for every other type is excluded in a later step. There are ways to parameterize your query. Prepared statements, PL/pgSQL function, see: <ul> <li>Split given string and prepare case statement</li> </ul> Or, if the underlying table does not change much, a materialized view with all pre-computed modes per <code>(group_id, type)</code> replacing the subquery would be an option. One more option: use a CTE to filter qualifying rows first, then the <code>WHERE</code> condition can stay outside of the subquery like you requested: <pre class="prettyprint"><code>WITH cte AS ( -- filter result rows first SELECT id, group_id, color FROM users u1 WHERE type = 'customer' -- predicate goes here ) SELECT * FROM cte u1 LEFT JOIN ( -- or JOIN, doesn't matter here SELECT group_id , mode() WITHIN GROUP (ORDER BY color) AS mode_color FROM cte -- based on only qualifying rows GROUP BY 1 ) u2 USING (group_id); </code></pre> We can simplify with <code>SELECT *</code> since <code>USING</code> conveniently places only one <code>group_id</code> in the result set. <h3>Mode of all rows</h3> If you want to base the mode on all rows (including those where <code>type = 'customer'</code> is not true), you need a different query. You get the most popular color per group among all members. Move the <code>WHERE</code> clause to the outer query: <pre class="prettyprint"><code>SELECT u1.id, u1.group_id, u1.color, u2.mode_color FROM users u1 LEFT JOIN ( -- or JOIN, doesn't matter here SELECT group_id , mode() WITHIN GROUP (ORDER BY color) AS mode_color FROM users GROUP BY group_id ) u2 USING (group_id) WHERE u1.type = 'customer'; </code></pre> If your predicate (<code>type = 'customer'</code>) is selective enough, computing the mode for all groups may be a waste. Filter the small subset first and only compute the mode for contained groups. Add a CTE for this: <pre class="prettyprint"><code>WITH cte AS ( -- filter result rows first SELECT id, group_id, color FROM users u1 WHERE type = 'customer' ) SELECT * FROM cte u1 LEFT JOIN ( -- or JOIN SELECT group_id , mode() WITHIN GROUP (ORDER BY color) AS mode_color FROM (SELECT DISTINCT group_id FROM cte) g -- only relevant groups JOIN users USING (group_id) -- but consider all rows for those GROUP BY 1 ) u2 USING (group_id); </code></pre> Similar to the CTE query above, but based on all group members in the base table.

How to get mode() in a window function in Postgres?

Tags:

I'm try to get the mode() for a grouped data set, but without grouping the results. (Using Postgres 9.5, can upgrade if needed.)

e.g. Users have a 'favorite color', and belong to a single group. Get list of users with the mode() 'favorite color' within their group.

A window function would work for most aggregates, but mode() seems to be an exception that isn't compatible with window functions. Is there another way to go about this? Here's what I've been toying with so far ...

Works but gives grouped results, I'm looking for the results to be ungrouped:

SELECT group_id, 
    mode() WITHIN GROUP (ORDER BY color)
FROM users
GROUP BY group_id;

Invalid syntax (just an example of what I'm trying to accomplish):

SELECT id, color, group_id, 
    mode(color) OVER (PARTITION BY group_id)
FROM users;

Or:

SELECT id, color, group_id, 
    mode() WITHIN GROUP (ORDER BY color) OVER (PARTITION BY group_id)
FROM users;

I tried using a lateral join, but couldn't get it to work right without re-iterating my WHERE clause both inside and outside the join (which I'd prefer not to do for when this query gets more complicated):

SELECT u1.id, u1.group_id, u1.color, mode_color
FROM users u1
LEFT JOIN LATERAL
    (SELECT group_id, mode() WITHIN GROUP (ORDER BY color) as mode_color
     FROM users
     WHERE group_id = d1.group_id
     GROUP BY group_id)
    u2 ON u1.group_id = u2.group_id
WHERE u1.type = 'customer';

It's important that WHERE u1.type = 'customer' stays outside of the subquery, as that's being appended to the query at a later point, after the first half of it is already written.

310

asked Apr 05 '19 19:04

PeanutsMcgee

1 Answers

We are talking about the ordered-set aggregate function mode(), introduced with Postgres 9.4. You probably saw this error message:

ERROR:  OVER is not supported for ordered-set aggregate mode

We can work around it. But which mode exactly?

_{(All assuming group_id and type are NOT NULL, else you need to do more.)}

Mode of qualifying rows

This computes the mode based on the filtered set (with type = 'customer') alone.
You get the most popular color per group among "customers".

A subquery in a plain JOIN (without LEFT and LATERAL in this case) would do the job - calculating the mode once per group, not for every individual row:

SELECT u1.id, u1.group_id, u1.color, u2.mode_color
FROM   users u1
JOIN  (                            -- not LATERAL
   SELECT group_id, type           -- propagate out for the join
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM   users 
   WHERE  type = 'customer'        -- place condition in subquery (cheap)
   GROUP  BY group_id, type
   ) u2 USING (group_id, type);    -- shorthand syntax for matching names
-- WHERE  type = 'customer'        -- or filter later (expensive)

To avoid repeating your condition, place it in the subquery and propagate it to the outer query in the join clause - I picked matching column names and joined with USING in my example.

You can move the condition to the outer query or even to a later step, yet. It will be needlessly more expensive, though, as the mode for every combination of (group_id, type) has to be calculated, before the results for every other type is excluded in a later step.

There are ways to parameterize your query. Prepared statements, PL/pgSQL function, see:

Split given string and prepare case statement

Or, if the underlying table does not change much, a materialized view with all pre-computed modes per (group_id, type) replacing the subquery would be an option.

One more option: use a CTE to filter qualifying rows first, then the WHERE condition can stay outside of the subquery like you requested:

WITH cte AS (  -- filter result rows first
   SELECT id, group_id, color
   FROM   users u1
   WHERE  type = 'customer'        -- predicate goes here
   )
SELECT *
FROM   cte u1
LEFT   JOIN (                      -- or JOIN, doesn't matter here
   SELECT group_id
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM   cte                      -- based on only qualifying rows
   GROUP  BY 1
   ) u2 USING (group_id);

We can simplify with SELECT * since USING conveniently places only one group_id in the result set.

Mode of all rows

If you want to base the mode on all rows (including those where type = 'customer' is not true), you need a different query.
You get the most popular color per group among all members.

Move the WHERE clause to the outer query:

SELECT u1.id, u1.group_id, u1.color, u2.mode_color
FROM   users u1
LEFT   JOIN (                      -- or JOIN, doesn't matter here
   SELECT group_id
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM   users
   GROUP  BY group_id
   ) u2 USING (group_id)
WHERE  u1.type = 'customer';

If your predicate (type = 'customer') is selective enough, computing the mode for all groups may be a waste. Filter the small subset first and only compute the mode for contained groups. Add a CTE for this:

WITH cte AS (  -- filter result rows first
   SELECT id, group_id, color
   FROM   users u1
   WHERE  type = 'customer'
   )
SELECT *
FROM   cte u1
LEFT   JOIN (        -- or JOIN
   SELECT group_id
        , mode() WITHIN GROUP (ORDER BY color) AS mode_color
   FROM  (SELECT DISTINCT group_id FROM cte) g  -- only relevant groups
   JOIN   users USING (group_id)                -- but consider all rows for those
   GROUP  BY 1
   ) u2 USING (group_id);

Similar to the CTE query above, but based on all group members in the base table.

192

answered Oct 09 '22 19:10

Erwin Brandstetter

Related questions
                            
                                Why is this specpoline not working on Kaby lake?
                            
                                runBlocking Coroutine doesn't block GlobalScope.launch (?)
                            
                                What does the fieldset validity mean?
                            
                                How do I check if user is 'logged in' in a react component when using laravel authentication?
                            
                                Bucket sort to find nearby almost duplicates
                            
                                Why does my long-running python script crash with "invalid pointer" after running for about 3 days?
                            
                                How to use at modules imports (@) in Jest's globalSetup function with TypeScript?
                            
                                What is the cause of 'InvalidArgumentError: Incompatible shapes: [10,2] vs. [10]' in tensorflow (with Keras)?
                            
                                JaCoCo shows 0% coverage, even all tests had passed
                            
                                Error when runnning Linux container in Azure Container Instances: failed to open log file "/var/log/pods/.../<container name>_0.log"
                            
                                Gitlab-installed Helm: Error: context deadline exceeded
                            
                                Partial specialization fails for container iterators [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With