PostgreSQL MAX() function is an aggregate function that returns the maximum value in a set of values. Syntax: MAX(expression); The MAX() function can be used with SELECT, WHERE and HAVING clause.
We used the MAX() function within a subquery to find the maximum value, and returned the whole row with the outer query.
The SQL MIN() and MAX() Functions The MAX() function returns the largest value of the selected column.
PostgreSQL ROW_NUMBER() function and DISTINCT operator The reason is that the ROW_NUMBER() operates on the result set before the DISTINCT is applied. Or we can use a subquery in the FROM clause to get a list of unique price, and then apply the ROW_NUMBER() function in the outer query.
I would propose a clean version based on DISTINCT ON
(see docs):
SELECT DISTINCT ON (usr_id)
time_stamp,
lives_remaining,
usr_id,
trans_id
FROM lives
ORDER BY usr_id, time_stamp DESC, trans_id DESC;
On a table with 158k pseudo-random rows (usr_id uniformly distributed between 0 and 10k, trans_id
uniformly distributed between 0 and 30),
By query cost, below, I am referring to Postgres' cost based optimizer's cost estimate (with Postgres' default xxx_cost
values), which is a weighed function estimate of required I/O and CPU resources; you can obtain this by firing up PgAdminIII and running "Query/Explain (F7)" on the query with "Query/Explain options" set to "Analyze"
usr_id
, trans_id
, time_stamp
))usr_id
, trans_id
))usr_id
, trans_id
, time_stamp
))usr_id
, EXTRACT(EPOCH FROM time_stamp)
, trans_id
))
usr_id
, time_stamp
, trans_id
)); it has the advantage of scanning the lives
table only once and, should you temporarily increase (if needed) work_mem to accommodate the sort in memory, it will be by far the fastest of all queries.All times above include retrieval of the full 10k rows result-set.
Your goal is minimal cost estimate and minimal query execution time, with an emphasis on estimated cost. Query execution can dependent significantly on runtime conditions (e.g. whether relevant rows are already fully cached in memory or not), whereas the cost estimate is not. On the other hand, keep in mind that cost estimate is exactly that, an estimate.
The best query execution time is obtained when running on a dedicated database without load (e.g. playing with pgAdminIII on a development PC.) Query time will vary in production based on actual machine load/data access spread. When one query appears slightly faster (<20%) than the other but has a much higher cost, it will generally be wiser to choose the one with higher execution time but lower cost.
When you expect that there will be no competition for memory on your production machine at the time the query is run (e.g. the RDBMS cache and filesystem cache won't be thrashed by concurrent queries and/or filesystem activity) then the query time you obtained in standalone (e.g. pgAdminIII on a development PC) mode will be representative. If there is contention on the production system, query time will degrade proportionally to the estimated cost ratio, as the query with the lower cost does not rely as much on cache whereas the query with higher cost will revisit the same data over and over (triggering additional I/O in the absence of a stable cache), e.g.:
cost | time (dedicated machine) | time (under load) |
-------------------+--------------------------+-----------------------+
some query A: 5k | (all data cached) 900ms | (less i/o) 1000ms |
some query B: 50k | (all data cached) 900ms | (lots of i/o) 10000ms |
Do not forget to run ANALYZE lives
once after creating the necessary indices.
Query #1
-- incrementally narrow down the result set via inner joins
-- the CBO may elect to perform one full index scan combined
-- with cascading index lookups, or as hash aggregates terminated
-- by one nested index lookup into lives - on my machine
-- the latter query plan was selected given my memory settings and
-- histogram
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(time_stamp) AS time_stamp_max
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
l1.time_stamp = l2.time_stamp_max
INNER JOIN (
SELECT
usr_id,
time_stamp,
MAX(trans_id) AS trans_max
FROM
lives
GROUP BY
usr_id, time_stamp
) AS l3
ON
l1.usr_id = l3.usr_id AND
l1.time_stamp = l3.time_stamp AND
l1.trans_id = l3.trans_max
Query #2
-- cheat to obtain a max of the (time_stamp, trans_id) tuple in one pass
-- this results in a single table scan and one nested index lookup into lives,
-- by far the least I/O intensive operation even in case of great scarcity
-- of memory (least reliant on cache for the best performance)
SELECT
l1.*
FROM
lives AS l1
INNER JOIN (
SELECT
usr_id,
MAX(ARRAY[EXTRACT(EPOCH FROM time_stamp),trans_id])
AS compound_time_stamp
FROM
lives
GROUP BY
usr_id
) AS l2
ON
l1.usr_id = l2.usr_id AND
EXTRACT(EPOCH FROM l1.time_stamp) = l2.compound_time_stamp[1] AND
l1.trans_id = l2.compound_time_stamp[2]
2013/01/29 update
Finally, as of version 8.4, Postgres supports Window Function meaning you can write something as simple and efficient as:
Query #3
-- use Window Functions
-- performs a SINGLE scan of the table
SELECT DISTINCT ON (usr_id)
last_value(time_stamp) OVER wnd,
last_value(lives_remaining) OVER wnd,
usr_id,
last_value(trans_id) OVER wnd
FROM lives
WINDOW wnd AS (
PARTITION BY usr_id ORDER BY time_stamp, trans_id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
);
Here's another method, which happens to use no correlated subqueries or GROUP BY. I'm not expert in PostgreSQL performance tuning, so I suggest you try both this and the solutions given by other folks to see which works better for you.
SELECT l1.*
FROM lives l1 LEFT OUTER JOIN lives l2
ON (l1.usr_id = l2.usr_id AND (l1.time_stamp < l2.time_stamp
OR (l1.time_stamp = l2.time_stamp AND l1.trans_id < l2.trans_id)))
WHERE l2.usr_id IS NULL
ORDER BY l1.usr_id;
I am assuming that trans_id
is unique at least over any given value of time_stamp
.
I like the style of Mike Woodhouse's answer on the other page you mentioned. It's especially concise when the thing being maximised over is just a single column, in which case the subquery can just use MAX(some_col)
and GROUP BY
the other columns, but in your case you have a 2-part quantity to be maximised, you can still do so by using ORDER BY
plus LIMIT 1
instead (as done by Quassnoi):
SELECT *
FROM lives outer
WHERE (usr_id, time_stamp, trans_id) IN (
SELECT usr_id, time_stamp, trans_id
FROM lives sq
WHERE sq.usr_id = outer.usr_id
ORDER BY trans_id, time_stamp
LIMIT 1
)
I find using the row-constructor syntax WHERE (a, b, c) IN (subquery)
nice because it cuts down on the amount of verbiage needed.
There is a new option in Postgressql 9.5 called DISTINCT ON
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports
ORDER BY location, time DESC;
It eliminates duplicate rows an leaves only the first row as defined my the ORDER BY clause.
see the official documentation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With