Say we have a table:
CREATE TABLE p
(
id serial NOT NULL,
val boolean NOT NULL,
PRIMARY KEY (id)
);
Populated with some rows:
insert into p (val)
values (true),(false),(false),(true),(true),(true),(false);
ID VAL 1 1 2 0 3 0 4 1 5 1 6 1 7 0
I want to determine when the value has been changed. So the result of my query should be:
ID VAL 2 0 4 1 7 0
I have a solution with joins and subqueries:
select min(id) id, val from
(
select p1.id, p1.val, max(p2.id) last_prev
from p p1
join p p2
on p2.id < p1.id and p2.val != p1.val
group by p1.id, p1.val
) tmp
group by val, last_prev
order by id;
But it is very inefficient and will work extremely slow for tables with many rows.
I believe there could be more efficient solution using PostgreSQL window functions?
SQL Fiddle
Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus.
Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed.
The most powerful tool at our disposal for understanding and optimizing SQL queries is EXPLAIN ANALYZE , which is a Postgres command that accepts a statement such as SELECT ... , UPDATE ... , or DELETE ... , executes the statement, and instead of returning the data provides a query plan detailing what approach the ...
Instead of calling COALESCE
, you can provide a default from the window function lag()
directly. A minor detail in this case since all columns are defined NOT NULL
. But this may be essential to distinguish "no previous row" from "NULL in previous row".
SELECT id, val
FROM (
SELECT id, val, lag(val, 1, val) OVER (ORDER BY id) <> val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Compute the result of the comparison immediately, since the previous value is not of interest per se, only a possible change. Shorter and may be a tiny bit faster.
If you consider the first row to be "changed" (unlike your demo output suggests), you need to observe NULL
values - even though your columns are defined NOT NULL
. Basic lag()
returns NULL
in case there is no previous row:
SELECT id, val
FROM (
SELECT id, val, lag(val) OVER (ORDER BY id) IS DISTINCT FROM val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
Or employ the additional parameters of lag()
once again:
SELECT id, val
FROM (
SELECT id, val, lag(val, 1, NOT val) OVER (ORDER BY id) <> val AS changed
FROM p
) sub
WHERE changed
ORDER BY id;
As proof of concept. :) Performance won't keep up with posted alternatives.
WITH RECURSIVE cte AS (
SELECT id, val
FROM p
WHERE NOT EXISTS (
SELECT 1
FROM p p0
WHERE p0.id < p.id
)
UNION ALL
SELECT p.id, p.val
FROM cte
JOIN p ON p.id > cte.id
AND p.val <> cte.val
WHERE NOT EXISTS (
SELECT 1
FROM p p0
WHERE p0.id > cte.id
AND p0.val <> cte.val
AND p0.id < p.id
)
)
SELECT * FROM cte;
With an improvement from @wildplasser.
SQL Fiddle demonstrating all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With