Say we have a table:
CREATE TABLE p
(
id serial NOT NULL,
val boolean NOT NULL,
PRIMARY KEY (id)
);
Populated with some rows:
insert into p (val)
values (true),(false),(false),(true),(true),(true),(false);
ID VAL 1 1 2 0 3 0 4 1 5 1 6 1 7 0
I want to determine when the value has been changed. So the result of my query should be:
ID VAL 2 0 4 1 7 0
I have a solution with joins and subqueries:
select min(id) id, val from
(
select p1.id, p1.val, max(p2.id) last_prev
from p p1
join p p2
on p2.id < p1.id and p2.val != p1.val
group by p1.id, p1.val
) tmp
group by val, last_prev
order by id;
But it is very inefficient and will work extremely slow for tables with many rows.
I believe there could be more efficient solution using PostgreSQL window functions?
SQL Fiddle
Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus.
Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed.
The most powerful tool at our disposal for understanding and optimizing SQL queries is EXPLAIN ANALYZE , which is a Postgres command that accepts a statement such as SELECT ... , UPDATE ... , or DELETE ... , executes the statement, and instead of returning the data provides a query plan detailing what approach the ...
Instead of calling COALESCE, you can provide a default from the window function lag() directly. A minor detail in this case since all columns are defined NOT NULL. But this may be essential to distinguish "no previous row" from "NULL in previous row".
SELECT id, val
FROM (
SELECT id, val, lag(val, 1, val) OVER (ORDER BY id) <> val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Compute the result of the comparison immediately, since the previous value is not of interest per se, only a possible change. Shorter and may be a tiny bit faster.
If you consider the first row to be "changed" (unlike your demo output suggests), you need to observe NULL values - even though your columns are defined NOT NULL. Basic lag() returns NULL in case there is no previous row:
SELECT id, val
FROM (
SELECT id, val, lag(val) OVER (ORDER BY id) IS DISTINCT FROM val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Or employ the additional parameters of lag() once again:
SELECT id, val
FROM (
SELECT id, val, lag(val, 1, NOT val) OVER (ORDER BY id) <> val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
As proof of concept. :) Performance won't keep up with posted alternatives.
WITH RECURSIVE cte AS (
SELECT id, val
FROM p
WHERE NOT EXISTS (
SELECT 1
FROM p p0
WHERE p0.id < p.id
)
UNION ALL
SELECT p.id, p.val
FROM cte
JOIN p ON p.id > cte.id
AND p.val <> cte.val
WHERE NOT EXISTS (
SELECT 1
FROM p p0
WHERE p0.id > cte.id
AND p0.val <> cte.val
AND p0.id < p.id
)
)
SELECT * FROM cte;
With an improvement from @wildplasser.
SQL Fiddle demonstrating all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With