Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return only the newest rows from a BigQuery table with a duplicate items

I have a table with many duplicate items – Many rows with the same id, perhaps with the only difference being a requested_at column.

I'd like to do a select * from the table, but only return one row with the same id – the most recently requested.

I've looked into group by id but then I need to do an aggregate for each column. This is easy with requested_at – max(requested_at) as requested_at – but the others are tough.

How do I make sure I get the value for title, etc that corresponds to that most recently updated row?

like image 806
Kevin Moore Avatar asked Dec 08 '15 20:12

Kevin Moore


People also ask

What is Upsert in BigQuery?

As mentioned earlier, BigQuery UPSERT means UPDATE+INSERT. It is essentially INSERT that behaves like an UPDATE in case of duplicates, so as to not violate any uniqueness constraints.

What is mode repeated in BigQuery?

A RECORD column can have REPEATED mode, which is represented as an array of STRUCT types. Also, a field within a record can be repeated, which is represented as a STRUCT that contains an ARRAY . An array cannot contain another array directly. For more information, see Declaring an ARRAY type.

How can I remove duplicate records from a table in SQL?

SQL Delete Duplicate Rows using Group By and Having Clause According to Delete Duplicate Rows in SQL, for finding duplicate rows, you need to use the SQL GROUP BY clause. The COUNT function can be used to verify the occurrence of a row using the Group by clause, which groups data according to the given columns.


2 Answers

I suggest a similar form that avoids a sort in the window function:

SELECT *
    FROM (
      SELECT
          *,
          MAX(<timestamp_column>)
              OVER (PARTITION BY <id_column>)
              AS max_timestamp,
      FROM <table>
    )
    WHERE <timestamp_column> = max_timestamp
like image 129
Matthew Wesley Avatar answered Jan 04 '23 05:01

Matthew Wesley


Try something like this:

    SELECT *
    FROM (
      SELECT
          *,
          ROW_NUMBER()
              OVER (
                  PARTITION BY <id_column>
                  ORDER BY <timestamp column> DESC)
              row_number,
      FROM <table>
    )
    WHERE row_number = 1

Note it will add a row_number column, which you might not want. To fix this, you can select individual columns by name in the outer select statement.

In your case, it sounds like the requested_at column is the one you want to use in the ORDER BY.

And, you will also want to use allow_large_results, set a destination table, and specify no flattening of results (if you have a schema with repeated fields).

like image 23
Jordan Tigani Avatar answered Jan 04 '23 04:01

Jordan Tigani