I have BQ table that is partitioned by insert time. I'm trying to remove duplicates from the table. These are true duplicates: for 2 duplicate rows, all columns are equal - of course having a unique key might have been helpful :-(
At first I tried a SELECT query to enumerate duplicates and remove them:
SELECT
* EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id_column) row_number
FROM
`mytable`)
WHERE
row_number = 1
This results in unique rows but creates a new table that doesn't include the partition data - so not good.
I've seen this answer here which states the only way to retain partitions is to go over them one-by-one with the above query and save to a specific target table partition.
What I'd really want to do is use a DML DELETE
to remove the duplicate rows in place. I tried something similar to what this answer suggested:
DELETE
FROM `mytable` AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id_column)
FROM `mytable ` AS d2
WHERE d.id = d2.id) > 1;
But the accepted answer doesn't work and results in a BQ error:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
Would be great if anyone could offer a simpler (DML or otherwise) way to deal with this so I won't be required to loop over all partitions individually.
To delete the duplicate rows from the table in SQL Server, you follow these steps: Find duplicate rows using GROUP BY clause or ROW_NUMBER() function. Use DELETE statement to remove the duplicate rows.
You can always over-write a partitioned table in BQ using the postfix of YYYYMMDD in the output table name of your query, along with using WRITE_TRUNCATE as your write disposition (i.e. to truncate whatever is existing in that partition and write new results).
When you create a partitioned table, you can require that all queries on the table must include a predicate filter (a WHERE clause) that filters on the partitioning column. This setting can improve performance and reduce costs, because BigQuery can use the filter to prune partitions that don't match the predicate.
Kind of a hack, but you can use the MERGE
statement to delete all of the contents of the table and reinsert only distinct rows atomically. Here's an example:
-- Create a table with some duplicate rows
CREATE TABLE dataset.PartitionedTable
PARTITION BY date AS
SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));
Now for the MERGE
part:
-- Execute a MERGE statement where all original rows are deleted,
-- then replaced with new, deduplicated rows:
MERGE dataset.PartitionedTable AS t1
USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
ON FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
WHEN NOT MATCHED BY SOURCE THEN DELETE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With