Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery - DELETE statement to remove duplicates

There are plenty of great posts on SQL that selects unique rows and write (truncates) a table so the dus are removed. e.g

WITH ev AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
  FROM `duplicates`
)
SELECT
  * EXCEPT(rowNum)
FROM
  ev
WHERE rowNum = 1

I was trying to explore this slightly differently using DML and DELETE (e.g if you don't want to use a BQ savedQuery, just execute SQL). What I want to do is roughly:

WITH dup_events AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
  FROM `duplicates`
)
DELETE FROM
  dup_events
WHERE rowNum > 1

but got this error in the console:

Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [10:1]

Can this be achieved (standardSQL) using DELETE?

like image 216
Kurt Maile Avatar asked Jan 06 '18 18:01

Kurt Maile


People also ask

How do I remove duplicates from query results?

The go to solution for removing duplicate rows from your result sets is to include the distinct keyword in your select statement. It tells the query engine to remove duplicates to produce a result set in which every row is unique.

How do you delete duplicates in SQL?

SQL Delete Duplicate Rows using Group By and Having Clause According to Delete Duplicate Rows in SQL, for finding duplicate rows, you need to use the SQL GROUP BY clause. The COUNT function can be used to verify the occurrence of a row using the Group by clause, which groups data according to the given columns.

How do I delete a record in BigQuery?

DELETE statement allows you to delete rows from a table. When using a DELETE statement, you must use WHERE clause followed by a condition. DELETE FROM `project. dataset.


3 Answers

Below actually :o) works

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
        SELECT AS STRUCT id, MAX(loadTime) loadTime 
        FROM `yourproject.yourdataset.duplicates` 
        GROUP BY id)  

Note: it assumes that loadTime is also unique - meaning if for given id there are more than one record with latest loadTime - they all will be preserved

like image 55
Mikhail Berlyant Avatar answered Oct 19 '22 01:10

Mikhail Berlyant


Those answers above works only for small size table. If you have a large size partition table, and only want to remove duplicates in a given range, use the SQL below:

-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table 
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------

DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");

MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
    FROM `gcp_project`.`data_set`.`the_table` AS original_data
    WHERE stamp BETWEEN dt_start AND dt_end
    GROUP BY surrogate_key
  )

) AS INTERNAL_SOURCE
ON FALSE

WHEN NOT MATCHED BY SOURCE
  AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
    THEN DELETE

WHEN NOT MATCHED THEN INSERT ROW

credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

like image 25
Hui Zheng Avatar answered Oct 19 '22 00:10

Hui Zheng


This has to be the easiest way:

create or replace table `myproject.mydataset.duplicates` as (
select distinct *
from `myproject.mydataset.duplicates`)

If you have an array data type, try this:

-- build a test table with a duplicate and an array datatype column --
create or replace table DW.pmoTest as (
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 2, 3]  as int_array, current_date as createdate
union all
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 7, 3] as int_array, current_date as createdate
union all
select 2 as ID, 'chamri' as firstname,ARRAY<INT64>[1, 2, 39, 4] as int_array, current_date as createdate
);

-- recreate table without duplicate row
create or replace table DW.pmoTest as (
SELECT col.* FROM (
  SELECT ARRAY_AGG(tbl ORDER BY createdate LIMIT 1)[OFFSET(0)]  col
  FROM DW.pmoTest tbl
  GROUP BY ID
  )
);
like image 45
PetArbiter Avatar answered Oct 19 '22 00:10

PetArbiter