I have a quite large table with a field ID
and another field as collection_time
. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time)
time is not unique together in my data. I want just one of records with the maximum collection time
. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded
error that I guess is because of ORDER BY
in the query.
Second Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time)
are not unique together so in JOIN
result there would be multiple rows for each ID
.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
ARRAY_AGG. Returns an ARRAY of expression values. To learn more about the optional arguments in this function and how to use them, see Aggregate function calls.
BigQuery STRING_AGG function returns a value by gathering all expressions in a row and generating a concatenated string of non-null values. The value can be either STRING or BYTES. The result generally depends on the first expression taken by the BigQuery STRING_AGG function.
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With