Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL - column value changed - select query optimization

Say we have a table:

CREATE TABLE p
(
   id serial NOT NULL, 
   val boolean NOT NULL, 
   PRIMARY KEY (id)
);

Populated with some rows:

insert into p (val)
values (true),(false),(false),(true),(true),(true),(false);
ID  VAL
1   1
2   0
3   0
4   1
5   1
6   1
7   0

I want to determine when the value has been changed. So the result of my query should be:

ID  VAL
2   0
4   1
7   0

I have a solution with joins and subqueries:

select min(id) id, val from
(
  select p1.id, p1.val, max(p2.id) last_prev
  from p p1
  join p p2
    on p2.id < p1.id and p2.val != p1.val
  group by p1.id, p1.val
) tmp
group by val, last_prev
order by id;

But it is very inefficient and will work extremely slow for tables with many rows.
I believe there could be more efficient solution using PostgreSQL window functions?

SQL Fiddle

like image 621
Nailgun Avatar asked Jun 07 '14 15:06

Nailgun


People also ask

How make PostgreSQL query run faster?

Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus.

Does number of columns affect performance in Postgres?

Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed.

How do you analyze a query performance in PostgreSQL?

The most powerful tool at our disposal for understanding and optimizing SQL queries is EXPLAIN ANALYZE , which is a Postgres command that accepts a statement such as SELECT ... , UPDATE ... , or DELETE ... , executes the statement, and instead of returning the data provides a query plan detailing what approach the ...


1 Answers

Window function

Instead of calling COALESCE, you can provide a default from the window function lag() directly. A minor detail in this case since all columns are defined NOT NULL. But this may be essential to distinguish "no previous row" from "NULL in previous row".

SELECT id, val
FROM  (
   SELECT id, val, lag(val, 1, val) OVER (ORDER BY id) <> val AS changed
   FROM   p
   ) sub
WHERE  changed
ORDER  BY id;

Compute the result of the comparison immediately, since the previous value is not of interest per se, only a possible change. Shorter and may be a tiny bit faster.

If you consider the first row to be "changed" (unlike your demo output suggests), you need to observe NULL values - even though your columns are defined NOT NULL. Basic lag() returns NULL in case there is no previous row:

SELECT id, val
FROM  (
   SELECT id, val, lag(val) OVER (ORDER BY id) IS DISTINCT FROM val AS changed
   FROM   p
   ) sub
WHERE  changed
ORDER  BY id;

Or employ the additional parameters of lag() once again:

SELECT id, val
FROM  (
   SELECT id, val, lag(val, 1, NOT val) OVER (ORDER BY id) <> val AS changed
   FROM   p
   ) sub
WHERE  changed
ORDER  BY id;

Recursive CTE

As proof of concept. :) Performance won't keep up with posted alternatives.

WITH RECURSIVE cte AS (
   SELECT id, val
   FROM   p
   WHERE  NOT EXISTS (
      SELECT 1
      FROM   p p0
      WHERE  p0.id < p.id
      )
  
   UNION ALL
   SELECT p.id, p.val
   FROM   cte
   JOIN   p ON p.id   > cte.id
           AND p.val <> cte.val
   WHERE NOT EXISTS (
     SELECT 1
     FROM   p p0
     WHERE  p0.id   > cte.id
     AND    p0.val <> cte.val
     AND    p0.id   < p.id
     )
  )
SELECT * FROM cte;

With an improvement from @wildplasser.

SQL Fiddle demonstrating all.

like image 77
Erwin Brandstetter Avatar answered Sep 23 '22 00:09

Erwin Brandstetter