I have a table with many duplicate items – Many rows with the same <code>id</code>, perhaps with the only difference being a <code>requested_at</code> column. I'd like to do a <code>select *</code> from the table, but only return one row with the same <code>id</code> – the most recently requested. I've looked into <code>group by id</code> but then I need to do an aggregate for each column. This is easy with <code>requested_at</code> – <code>max(requested_at) as requested_at</code> – but the others are tough. How do I make sure I get the value for <code>title</code>, etc that corresponds to that most recently updated row?

I suggest a similar form that avoids a sort in the window function: <pre class="prettyprint"><code>SELECT * FROM ( SELECT *, MAX(<timestamp_column>) OVER (PARTITION BY <id_column>) AS max_timestamp, FROM <table> ) WHERE <timestamp_column> = max_timestamp </code></pre>

Return only the newest rows from a BigQuery table with a duplicate items

2 Answers

I suggest a similar form that avoids a sort in the window function:

SELECT *
    FROM (
      SELECT
          *,
          MAX(<timestamp_column>)
              OVER (PARTITION BY <id_column>)
              AS max_timestamp,
      FROM <table>
    )
    WHERE <timestamp_column> = max_timestamp

129

answered Jan 04 '23 05:01

Matthew Wesley

Try something like this:

    SELECT *
    FROM (
      SELECT
          *,
          ROW_NUMBER()
              OVER (
                  PARTITION BY <id_column>
                  ORDER BY <timestamp column> DESC)
              row_number,
      FROM <table>
    )
    WHERE row_number = 1

Note it will add a row_number column, which you might not want. To fix this, you can select individual columns by name in the outer select statement.

In your case, it sounds like the requested_at column is the one you want to use in the ORDER BY.

And, you will also want to use allow_large_results, set a destination table, and specify no flattening of results (if you have a schema with repeated fields).

answered Jan 04 '23 04:01

Jordan Tigani

Related questions
                            
                                BigQuery: Deleting Duplicates in Partitioned Table
                            
                                How to set permissions for specific dataset on Google BigQuery?
                            
                                Cannot Read Bigquery table sourced from Google Sheet (Oath / Scope Error)
                            
                                Accessing BigQuery with Google Spreadsheet
                            
                                Computing a moving maximum in BigQuery
                            
                                Google Big-query api 403-Forbidden Exception
                            
                                Google BigQuery asking for JOIN EACH but I'm already using it
                            
                                Wilcard on day table vs time partition
                            
                                Load a huge data from BigQuery to python/pandas/dask
                            
                                When I query a partitioned table, is it possible to filter by partition column with a subquery and reduce cost at the same time?
                            
                                I have daily tables on BigQuery. How to query the "newest" one?
                            
                                'TRIM' or 'PROPER' in BigQuery
                            
                                BigQuery: How to Avoid "Resources exceeded during query execution." error
                            
                                "bad double value" in Google BigQuery
                            
                                Does Bigquery support triggers?
                            
                                Create a column of UUIDs in Google BigQuery
                            
                                Syntax error: Unexpected string literal '93868086.ga_sessions_' at [1:244] - BigQuery
                            
                                Bigquery ORDER BY (count )
                            
                                Big query is to slow
                            
                                How to get the first not null value from a column of values in Big Query?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Return only the newest rows from a BigQuery table with a duplicate items

Tags:

google-bigquery

Kevin Moore

People also ask

2 Answers

Matthew Wesley

Jordan Tigani

Recent Activity

Donate For Us