BigQuery: Deleting Duplicates in Partitioned Table

Tags:

I have BQ table that is partitioned by insert time. I'm trying to remove duplicates from the table. These are true duplicates: for 2 duplicate rows, all columns are equal - of course having a unique key might have been helpful :-(

At first I tried a SELECT query to enumerate duplicates and remove them:

SELECT
    * EXCEPT(row_number)
FROM (
    SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id_column) row_number
    FROM
    `mytable`)
WHERE
    row_number = 1

This results in unique rows but creates a new table that doesn't include the partition data - so not good.

I've seen this answer here which states the only way to retain partitions is to go over them one-by-one with the above query and save to a specific target table partition.

What I'd really want to do is use a DML DELETE to remove the duplicate rows in place. I tried something similar to what this answer suggested:

DELETE
FROM `mytable` AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id_column)
   FROM `mytable ` AS d2
   WHERE d.id = d2.id) > 1;

But the accepted answer doesn't work and results in a BQ error:

Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN

Would be great if anyone could offer a simpler (DML or otherwise) way to deal with this so I won't be required to loop over all partitions individually.

922

asked Dec 06 '18 11:12

Shai Ben-Tovim

1 Answers

Kind of a hack, but you can use the MERGE statement to delete all of the contents of the table and reinsert only distinct rows atomically. Here's an example:

-- Create a table with some duplicate rows
CREATE TABLE dataset.PartitionedTable
PARTITION BY date AS
SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));

Now for the MERGE part:

-- Execute a MERGE statement where all original rows are deleted,
-- then replaced with new, deduplicated rows:
MERGE dataset.PartitionedTable AS t1
USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
ON FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
WHEN NOT MATCHED BY SOURCE THEN DELETE

137

answered Oct 19 '22 23:10

Elliott Brossard

Related questions
                            
                                Apps Script, convert a Sheet range to Blob
                            
                                Need help formatting datetime timezone for Google API
                            
                                How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?
                            
                                BigQuery Standard SQL: Delete Duplicates from Table
                            
                                Python Unit Testing Google Bigquery
                            
                                Resources exceeded BigQuery
                            
                                Unable to use json body of gcp cloud scheduler in cloud function as parameter value?
                            
                                Obtaining BigQuery data from JavaScript code
                            
                                BigQuery Subtract Counts of Two Tables?
                            
                                How to use bigquery correlation based on many columns?
                            
                                How to scale Pivoting in BigQuery?
                            
                                SHA-256 BigQuery function or UDF
                            
                                How to change default Options in BigQuery console (Web UI), especially uncheck "Use Legacy SQL"?
                            
                                Bigquery: Partitioning data past 2000 limit (Update: Now 4000 limit) [duplicate]
                            
                                Convert Bigquery results to Pandas Data Frame
                            
                                Are some bigquery public datasets no longer available?
                            
                                Airflow BigQueryOperator: how to save query result in a partitioned Table?
                            
                                Cannot query over table without a filter that can be used for partition elimination
                            
                                How to get intersection of two arrays in BigQuery
                            
                                I want a "materialized view" of the latest records

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BigQuery: Deleting Duplicates in Partitioned Table

Tags:

google-bigquery

bigquery-standard-sql

Shai Ben-Tovim

People also ask

1 Answers

Elliott Brossard

Recent Activity

Donate For Us