SELECTing newest entries for each item in a large dataset

Question

So, i have a table with roughly 1.5 million rows in it, looking somewhat like this:

name   | time       | data1 | data2  
--------------------------------------
 93-15 | 1337348782 |   11  | 60.791 
 92-02 | 1337348783 |   11  | 62.584 
 92-02 | 1337348056 |   11  | 63.281
 93-15 | 1337348068 |    8  | 65.849
 92-02 | 1337348117 |   11  | 63.271 
 93-15 | 1337348129 |    8  | 65.849 
 92-02 | 1337348176 |   10  | 63.258 
 93-15 | 1337348188 |    8  | 65.849 
 92-02 | 1337348238 |   10  | 63.245 
 93-15 | 1337348248 |    8  | 65.849

...these correspond to historical status updates from something that needs to be monitored. Now, what i would like to do is to find the current status if each unit.

It wasn't hard finding similar questions here on stackoverflow, and extrapolating from the findingsat, i came up with this query:

SELECT * FROM vehicles v
  JOIN ( SELECT  MAX(time) as max, name
    FROM vehicles
    GROUP BY name)
  m_v
ON (v.time = m_v.max AND v.name = m_v.name);

but seeing as i have roughly 1.5 million rows (and counting), is there a different approach that allows for a faster query?

MatBailie · Accepted Answer

WITH
  sequenced_data
AS
(
  SELECT
    ROW_NUMBER() OVER (PARTITION BY name ORDER BY time DESC) AS sequence_id,
    *
  FROM
    vehicles
)
SELECT
  *
FROM
  sequenced_data
WHERE
  sequence_id = 1

A covering index on (name, time) would be helpful too.

EDIT: Notes on how it work, etc.

PostgreSQL has what are know as windowing or analytical functions. These generally take the form some_function() OVER (PARTITION BY some_fields ORDER BY some_fields).

In this case I used ROW_NUMBER() OVER (PARTITION BY name ORDER BY time DESC).

ROW_NUMBER() creates a unique row number for a set of data. 1 to n for n records.

PARTITION BY name means that this function is applied independently to different names. Each name is it's own group/window/partition, and the results of ROW_NUMBER() start over from 1 again for each group/window/partition.

ORDER BY time DESC takes all the records in one group/window/partition and orders them by the time field, with the highest value first, before the ROW_NUMBER() function is applied.

For your example data, therefore, you get this...

 name  | time       | data1 | data2  | row_number
--------------------------------------------------

 92-02 | 1337348783 |   11  | 62.584 | 1
 92-02 | 1337348238 |   10  | 63.245 | 2
 92-02 | 1337348176 |   10  | 63.258 | 3
 92-02 | 1337348117 |   11  | 63.271 | 4
 92-02 | 1337348056 |   11  | 63.281 | 5

 93-15 | 1337348782 |   11  | 60.791 | 1
 93-15 | 1337348248 |    8  | 65.849 | 2
 93-15 | 1337348188 |    8  | 65.849 | 3
 93-15 | 1337348129 |    8  | 65.849 | 4
 93-15 | 1337348068 |    8  | 65.849 | 5

Because the ordering is time DESC, the highest valued time field, in each name group/window/partition, will always have a row_number of 1.

Having an index on (name, time) makes it much easier for the optimiser by ensuring the data is in a friendly order. This means that ROW_NUMBER() isn't actually applied to all the records; as soon as it finds the highest valued time record, and assigns ROW_NUMBER() = 1, it knows it can stop and move on to the next name.

SELECTing newest entries for each item in a large dataset

Tags:

performance

sql

postgresql

Jarmund

1 Answers

MatBailie

Recent Activity

Donate For Us

SELECTing newest entries for each item in a large dataset

Tags:

performance

sql

postgresql

Jarmund

1 Answers

MatBailie

Related questions

Recent Activity

Donate For Us