There are plenty of great posts on SQL that selects unique rows and write (truncates) a table so the dus are removed. e.g
WITH ev AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
FROM `duplicates`
)
SELECT
* EXCEPT(rowNum)
FROM
ev
WHERE rowNum = 1
I was trying to explore this slightly differently using DML and DELETE
(e.g if you don't want to use a BQ savedQuery, just execute SQL). What I want to do is roughly:
WITH dup_events AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
FROM `duplicates`
)
DELETE FROM
dup_events
WHERE rowNum > 1
but got this error in the console:
Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [10:1]
Can this be achieved (standardSQL) using DELETE
?
The go to solution for removing duplicate rows from your result sets is to include the distinct keyword in your select statement. It tells the query engine to remove duplicates to produce a result set in which every row is unique.
SQL Delete Duplicate Rows using Group By and Having Clause According to Delete Duplicate Rows in SQL, for finding duplicate rows, you need to use the SQL GROUP BY clause. The COUNT function can be used to verify the occurrence of a row using the Group by clause, which groups data according to the given columns.
DELETE statement allows you to delete rows from a table. When using a DELETE statement, you must use WHERE clause followed by a condition. DELETE FROM `project. dataset.
Below actually :o) works
#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
SELECT AS STRUCT id, MAX(loadTime) loadTime
FROM `yourproject.yourdataset.duplicates`
GROUP BY id)
Note: it assumes that loadTime is also unique - meaning if for given id there are more than one record with latest loadTime - they all will be preserved
Those answers above works only for small size table. If you have a large size partition table, and only want to remove duplicates in a given range, use the SQL below:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
This has to be the easiest way:
create or replace table `myproject.mydataset.duplicates` as (
select distinct *
from `myproject.mydataset.duplicates`)
If you have an array data type, try this:
-- build a test table with a duplicate and an array datatype column --
create or replace table DW.pmoTest as (
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 2, 3] as int_array, current_date as createdate
union all
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 7, 3] as int_array, current_date as createdate
union all
select 2 as ID, 'chamri' as firstname,ARRAY<INT64>[1, 2, 39, 4] as int_array, current_date as createdate
);
-- recreate table without duplicate row
create or replace table DW.pmoTest as (
SELECT col.* FROM (
SELECT ARRAY_AGG(tbl ORDER BY createdate LIMIT 1)[OFFSET(0)] col
FROM DW.pmoTest tbl
GROUP BY ID
)
);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With