I have a quite large table with a field <code>ID</code> and another field as <code>collection_time</code>. I want to select latest record for each ID. Unfortunately combination of <code>(ID, collection_time)</code> time is not unique together in my data. I want just one of records with the maximum <code>collection time</code>. I have tried two solutions but none of them has worked for me: First: using query <pre class="prettyprint"><code>SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn FROM mytable) where rn=1 </code></pre> This results in <code>Resources exceeded</code> error that I guess is because of <code>ORDER BY</code> in the query. Second Using join between table and latest time: <pre class="prettyprint"><code>(SELECT tab1.* FROM mytable AS tab1 INNER JOIN EACH (SELECT ID, MAX(collection_time) AS second_time FROM mytable GROUP EACH BY ID) AS tab2 ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time) </code></pre> this solution does not work for me because <code>(ID, collection_time)</code> are not unique together so in <code>JOIN</code> result there would be multiple rows for each <code>ID</code>. I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?

<pre class="prettyprint"><code>SELECT agg.table.* FROM ( SELECT id, ARRAY_AGG(STRUCT(table) ORDER BY collection_time DESC)[SAFE_OFFSET(0)] agg FROM `dataset.table` table GROUP BY id) </code></pre> This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this

Scalable Solution to get latest row for each ID in BigQuery

Tags:

sql

google-bigquery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:

First: using query

SELECT *  FROM 
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn 
FROM mytable)  where rn=1

This results in Resources exceeded error that I guess is because of ORDER BY in the query.

Second Using join between table and latest time:

(SELECT tab1.* 
FROM mytable AS tab1
INNER JOIN EACH 
(SELECT ID, MAX(collection_time) AS second_time 
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)

this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.

I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?

567

asked Aug 28 '16 05:08

S.Mohsen sh

1 Answers

SELECT
  agg.table.*
FROM (
  SELECT
    id,
    ARRAY_AGG(STRUCT(table)
    ORDER BY
      collection_time DESC)[SAFE_OFFSET(0)] agg
  FROM
    `dataset.table` table
  GROUP BY
    id)

This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this

106

answered Nov 07 '22 04:11

Mit Parekh

Related questions
                            
                                Inserting One to Many Entities using dapper
                            
                                How to use NOT EXISTS with COMPOSITE KEYS in SQL for inserting data from POJO
                            
                                SQL query that returns value based on lookup of id in another table
                            
                                Error in executing SQL statement with PostgreSQL
                            
                                column order in SELECT * statement - guaranteed?
                            
                                MySQL Order by Two Columns
                            
                                Same number of columns for Union Operation
                            
                                SQL Server Merge 2 rows into 1
                            
                                "ORA-00903: invalid table name" error while updating a record
                            
                                Return all values including NULL
                            
                                How to map keys to values for an individual field in a MySQL select query
                            
                                MySQL: Creating table with two foreign keys fails with "Duplicate key name" error
                            
                                Getting error "commandtext property has not been initialized" when sql compare of Red Gate
                            
                                SQL Server 2005 Using CHARINDEX() To split a string
                            
                                Postgresql Trigger based on value of one column to change or update value of other column in each row. [closed]
                            
                                Does MySQL update the index on all inserts? Can I make it update after every x inserts?
                            
                                Postgres: Expand JSON column into rows
                            
                                Query table from another ORACLE database
                            
                                a Large one table with 100 column vs a lot of little tables
                            
                                How to ignore nulls in PostgreSQL window functions? or return the next non-null value in a column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With