Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of max() vs ORDER BY DESC + LIMIT 1

I was troubleshooting a few slow SQL queries today and don't quite understand the performance difference below:

When trying to extract the max(timestamp) from a data table based on some condition, using MAX() is slower than ORDER BY timestamp LIMIT 1 if a matching row exists, but considerably faster if no matching row is found.

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4
ORDER BY timestamp DESC
LIMIT 1;
(0 rows)  
Time: 1314.544 ms

SELECT timestamp
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5
ORDER BY timestamp DESC
LIMIT 1;
(1 row)  
Time: 10.890 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 4;
(0 rows)
Time: 0.869 ms

SELECT MAX(timestamp)
FROM data JOIN sensors ON ( sensors.id = data.sensor_id )
WHERE sensor.station_id = 5;
(1 row)
Time: 84.087 ms 

There are indexes on (timestamp) and (sensor_id, timestamp), and I noticed that Postgres uses very different query plans and indexes for both cases:

QUERY PLAN (ORDER BY)                                              
--------------------------------------------------------------------------------------------------------
Limit  (cost=0.43..9.47 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..396254.63 rows=43823 width=8)
          Join Filter: (data.sensor_id = sensors.id)
          ->  Index Scan using timestamp_ind on data  (cost=0.43..254918.66 rows=4710976 width=12)
          ->  Materialize  (cost=0.00..6.70 rows=2 width=4)
              ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
                  Filter: (station_id = 4)
(7 rows)

QUERY PLAN (MAX)                                               
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3680.59..3680.60 rows=1 width=8)
    ->  Nested Loop  (cost=0.43..3571.03 rows=43823 width=8)
        ->  Seq Scan on sensors  (cost=0.00..6.69 rows=2 width=4)
              Filter: (station_id = 4)
        ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12)
              Index Cond: (sensor_id = sensors.id)
(6 rows)

So my two questions are:

  1. Where does this performance difference come from? I've seen the accepted answer here MIN/MAX vs ORDER BY and LIMIT, but that doesn't quite seem to apply here. Any good resources would be appreciated.
  2. Are there better ways to increase performance in all cases (matching row vs no matching row) than adding an EXISTS check?

EDIT to address the questions in the comments below. I kept the initial query plans above for future reference:

Table definitions:

                                  Table "public.sensors"
        Column        |          Type          |                            Modifiers                            
----------------------+------------------------+-----------------------------------------------------------------
id                    | integer                | not null default nextval('sensors_id_seq'::regclass)
station_id            | integer                | not null
....

Indexes:
    "sensor_primary" PRIMARY KEY, btree (id)
    "ind_station_id" btree (station_id, id)
    "ind_station" btree (station_id)

                                  Table "public.data"
  Column   |           Type           |                            Modifiers                             
-----------+--------------------------+------------------------------------------------------------------
 id        | integer                  | not null default nextval('data_id_seq'::regclass)
 timestamp | timestamp with time zone | not null
 sensor_id | integer                  | not null
 avg       | integer                  |

Indexes:
    "timestamp_ind" btree ("timestamp" DESC)
    "sensor_ind" btree (sensor_id)
    "sensor_ind_timestamp" btree (sensor_id, "timestamp")
    "sensor_ind_timestamp_desc" btree (sensor_id, "timestamp" DESC)

Note that I added ind_station_id on sensors just now after @Erwin's suggestion below. Timings haven't really changed drastically, still >1200ms in the ORDER BY DESC + LIMIT 1 case and ~0.9ms in the MAX case.

Query Plans:

QUERY PLAN (ORDER BY)
----------------------------------------------------------------------------------------------------------
Limit  (cost=0.58..9.62 rows=1 width=8) (actual time=2161.054..2161.054 rows=0 loops=1)
  Buffers: shared hit=3418066 read=47326
  ->  Nested Loop  (cost=0.58..396382.45 rows=43823 width=8) (actual time=2161.053..2161.053 rows=0 loops=1)
        Join Filter: (data.sensor_id = sensors.id)
        Buffers: shared hit=3418066 read=47326
        ->  Index Scan using timestamp_ind on data  (cost=0.43..255048.99 rows=4710976 width=12) (actual time=0.047..1410.715 rows=4710976 loops=1)
              Buffers: shared hit=3418065 read=47326
        ->  Materialize  (cost=0.14..4.19 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=4710976)
              Buffers: shared hit=1
              ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.004..0.004 rows=0 loops=1)
                    Index Cond: (station_id = 4)
                    Heap Fetches: 0
                    Buffers: shared hit=1
Planning time: 0.478 ms
Execution time: 2161.090 ms
(15 rows)

QUERY (MAX)
----------------------------------------------------------------------------------------------------------
Aggregate  (cost=3678.08..3678.09 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=1)
   Buffers: shared hit=1
   ->  Nested Loop  (cost=0.58..3568.52 rows=43823 width=8) (actual time=0.006..0.006 rows=0 loops=1)
         Buffers: shared hit=1
         ->  Index Only Scan using ind_station_id on sensors  (cost=0.14..4.18 rows=2 width=4) (actual time=0.005..0.005 rows=0 loops=1)
               Index Cond: (station_id = 4)
               Heap Fetches: 0
               Buffers: shared hit=1
         ->  Index Only Scan using sensor_ind_timestamp on data  (cost=0.43..1389.59 rows=39258 width=12) (never executed)
               Index Cond: (sensor_id = sensors.id)
               Heap Fetches: 0
 Planning time: 0.435 ms
 Execution time: 0.048 ms
 (13 rows)

So just like in the earlier explains, ORDER BY does a Scan using timestamp_in on data, which is not done in the MAX case.

Postgres version: Postgres from the Ubuntu repos: PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 5.2.1-21ubuntu2) 5.2.1 20151003, 64-bit

Note that there are NOT NULL constraints in place, so ORDER BY won't have to sort over empty rows.

Note also that I'm largely interested in where the difference comes from. While not ideal, I can retrieve data relatively quickly using EXISTS (<1ms) and then SELECT (~11ms).

like image 807
Geotob Avatar asked Dec 12 '15 23:12

Geotob


People also ask

Can we use max with group by?

MySQL MAX() function with GROUP BY retrieves maximum value of an expression which has undergone a grouping operation (usually based upon one column or a list of comma-separated columns).

Can we use Max on a column?

Expression made up of a single constant, variable, scalar function, or column name or any combination of arithmetic, bitwise, and string operators. MAX can be used with numeric, character, and datetime columns, but not with bit columns.

Can you use Max on a string?

MAX() with String ColumnThe MAX() function can be used on the string column.

Can we use order by and limit in SQL?

ORDER BY LIMIT is used to get rows from table in sorting order either in ascending or descending order and to limit rows in result-set. ORDER BY LIMIT is not supported in all databases. ORDER BY LIMIT works only in MySQL.


1 Answers

There does not seem to be an index on sensor.station_id, which is most probably important here.

There is an actual difference between max() and ORDER BY DESC + LIMIT 1. Many people seem to miss that. NULL values sort first in descending sort order. So ORDER BY timestamp DESC LIMIT 1 returns a row with timestamp IS NULL if it exists, while the aggregate function max() ignores NULL values and returns the latest non-null timestamp.

For your case, since your column d.timestamp is defined NOT NULL (as your update revealed), there is no effective difference. An index with DESC NULLS LAST and the same clause in the ORDER BY for the LIMIT query should still serve you best. I suggest these indexes (my query below builds on the 2nd one):

sensor(station_id, id)
data(sensor_id, timestamp DESC NULLS LAST)

You can drop the other index variants sensor_ind_timestamp and sensor_ind_timestamp_desc unless you have other queries that still require them (unlikely, but possible).

Much more importantly, there is another difficulty: The filter on the first table sensors returns few, but still (possibly) multiple rows. Postgres expects to find 2 rows (rows=2) in your added EXPLAIN output.
The perfect technique would be a loose index scan for the second table data - which is not currently implemented in Postgres 9.4 (or Postgres 9.5). You can rewrite the query to work around this limitation in various ways. Details:

  • Optimize GROUP BY query to retrieve latest record per user

The best should be:

SELECT d.timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4
ORDER  BY d.timestamp DESC NULLS LAST
LIMIT  1;

Since the style of outer query is mostly irrelevant, you can also just:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ORDER  BY timestamp DESC NULLS LAST
   LIMIT  1
   ) d
WHERE  s.station_id = 4;

And the max() variant should perform about as fast now:

SELECT max(d.timestamp) AS timestamp
FROM   sensors s
CROSS  JOIN LATERAL  (
   SELECT max(timestamp) AS timestamp
   FROM   data
   WHERE  sensor_id = s.id
   ) d
WHERE  s.station_id = 4;

Or even, shortest of all:

SELECT max((SELECT max(timestamp) FROM data WHERE sensor_id = s.id)) AS timestamp
FROM   sensors s
WHERE  station_id = 4;

Note the double parentheses!

The additional advantage of LIMIT in a LATERAL join is that you can retrieve arbitrary columns of the selected row, not just the latest timestamp (one column).

Related:

  • Why do NULL values come first when ordering DESC in a PostgreSQL query?
  • What is the difference between LATERAL and a subquery in PostgreSQL?
  • Select first row in each GROUP BY group?
  • Optimize groupwise maximum query
like image 53
Erwin Brandstetter Avatar answered Sep 18 '22 18:09

Erwin Brandstetter