Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:

First: using query

SELECT *  FROM 
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn 
FROM mytable)  where rn=1

This results in Resources exceeded error that I guess is because of ORDER BY in the query.

Second Using join between table and latest time:

(SELECT tab1.* 
FROM mytable AS tab1
INNER JOIN EACH 
(SELECT ID, MAX(collection_time) AS second_time 
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time) 

this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.

I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?

like image 567
S.Mohsen sh Avatar asked Aug 28 '16 05:08

S.Mohsen sh


People also ask

What does Array_agg do in BigQuery?

ARRAY_AGG. Returns an ARRAY of expression values. To learn more about the optional arguments in this function and how to use them, see Aggregate function calls.

What is String_agg in BigQuery?

BigQuery STRING_AGG function returns a value by gathering all expressions in a row and generating a concatenated string of non-null values. The value can be either STRING or BYTES. The result generally depends on the first expression taken by the BigQuery STRING_AGG function.


1 Answers

SELECT
  agg.table.*
FROM (
  SELECT
    id,
    ARRAY_AGG(STRUCT(table)
    ORDER BY
      collection_time DESC)[SAFE_OFFSET(0)] agg
  FROM
    `dataset.table` table
  GROUP BY
    id)

This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this

like image 106
Mit Parekh Avatar answered Nov 07 '22 04:11

Mit Parekh