I need to take the first N rows for each group, ordered by custom column.
Given the following table:
db=# SELECT * FROM xxx;
id | section_id | name
----+------------+------
1 | 1 | A
2 | 1 | B
3 | 1 | C
4 | 1 | D
5 | 2 | E
6 | 2 | F
7 | 3 | G
8 | 2 | H
(8 rows)
I need the first 2 rows (ordered by name) for each section_id, i.e. a result similar to:
id | section_id | name
----+------------+------
1 | 1 | A
2 | 1 | B
5 | 2 | E
6 | 2 | F
7 | 3 | G
(5 rows)
I am using PostgreSQL 8.3.5.
The PostgreSQL LIMIT clause is used to get a subset of rows generated by a query. It is an optional clause of the SELECT statement. The LIMIT clause can be used with the OFFSET clause to skip a specific number of rows before returning the query for the LIMIT clause.
The basic SQL standard query to count the rows in a table is: SELECT count(*) FROM table_name; This can be rather slow because PostgreSQL has to check visibility for all rows, due to the MVCC model.
You are right that the result is the same no matter in which order the columns appear in the GROUP BY clause, and that the same execution plan could be used. The PostgreSQL optimizer just doesn't consider reordering the GROUP BY expressions to see if a different ordering would match an existing index.
The PostgreSQL GROUP BY clause is used in collaboration with the SELECT statement to group together those rows in a table that have identical data. This is done to eliminate redundancy in the output and/or compute aggregates that apply to these groups.
New solution (PostgreSQL 8.4)
SELECT
*
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY section_id ORDER BY name) AS r,
t.*
FROM
xxx t) x
WHERE
x.r <= 2;
Since v9.3 you can do a lateral join
select distinct t_outer.section_id, t_top.id, t_top.name from t t_outer
join lateral (
select * from t t_inner
where t_inner.section_id = t_outer.section_id
order by t_inner.name
limit 2
) t_top on true
order by t_outer.section_id;
It might be faster but, of course, you should test performance specifically on your data and use case.
Here's another solution (PostgreSQL <= 8.3).
SELECT
*
FROM
xxx a
WHERE (
SELECT
COUNT(*)
FROM
xxx
WHERE
section_id = a.section_id
AND
name <= a.name
) <= 2
SELECT x.*
FROM (
SELECT section_id,
COALESCE
(
(
SELECT xi
FROM xxx xi
WHERE xi.section_id = xo.section_id
ORDER BY
name, id
OFFSET 1 LIMIT 1
),
(
SELECT xi
FROM xxx xi
WHERE xi.section_id = xo.section_id
ORDER BY
name DESC, id DESC
LIMIT 1
)
) AS mlast
FROM (
SELECT DISTINCT section_id
FROM xxx
) xo
) xoo
JOIN xxx x
ON x.section_id = xoo.section_id
AND (x.name, x.id) <= ((mlast).name, (mlast).id)
-- ranking without WINDOW functions
-- EXPLAIN ANALYZE
WITH rnk AS (
SELECT x1.id
, COUNT(x2.id) AS rnk
FROM xxx x1
LEFT JOIN xxx x2 ON x1.section_id = x2.section_id AND x2.name <= x1.name
GROUP BY x1.id
)
SELECT this.*
FROM xxx this
JOIN rnk ON rnk.id = this.id
WHERE rnk.rnk <=2
ORDER BY this.section_id, rnk.rnk
;
-- The same without using a CTE
-- EXPLAIN ANALYZE
SELECT this.*
FROM xxx this
JOIN ( SELECT x1.id
, COUNT(x2.id) AS rnk
FROM xxx x1
LEFT JOIN xxx x2 ON x1.section_id = x2.section_id AND x2.name <= x1.name
GROUP BY x1.id
) rnk
ON rnk.id = this.id
WHERE rnk.rnk <=2
ORDER BY this.section_id, rnk.rnk
;
A lateral join is the way to go, but you should do a nested query first to improve performance on large tables.
SELECT t_limited.*
FROM (
SELECT DISTINCT section_id
FROM t
) t_groups
JOIN LATERAL (
SELECT *
FROM t t_all
WHERE t_all.section_id = t_groups.section_id
ORDER BY t_all.name
LIMIT 2
) t_limited ON true
Without the nested select distinct, the join lateral runs for every line in the table, even though the section_id is often duplicated. With the nested select distinct, the join lateral runs once and only once for each distinct section_id.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With